Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias/Section 5

From Wikisource
Jump to navigation Jump to search

5. Results

The following section presents the results of this study. The results will be presented in two sub-sections based on the qualitative and quantitative analysis. The findings will be discussed with relationship to each other both in the context of this study and in the context of previous work in Section 6 (Discussion).

5.1 Quantitative Analysis

This section presents the findings following the quantitative analysis of the data from this study. The results of the quantitative analysis will be presented under the broad headings listed in Section 4.2 above.

Stage 1: Exploratory Data Analysis

The characteristics of the dimensions for assessing the quality of articles in the entire sample are presented in table 5.1., and discussed in Section 3.4.1 above. The distributions of the dimensions are presented in table 5.2. Only the dimensions of accuracy and style/ readability for the alternative encyclopaedia were found to be normally distributed. The remaining dimensions for both Wikipedia and the alternative encyclopaedia were found to be not normally distributed.


Dimension Minimum Maximum Mean Std. Deviation
Wikipedia (n=64)
Accuracy 0.00 5.00 3.87 1.04
References 0.00 5.00 3.07 1.47
Style/ Readability 0.00 4.29 3.04 0.93
Overall Judgment 0.00 1.00 0.37 0.32
Overall Quality Score 0.65 3.26 2.18 0.65
Alternative Encyclopaedia (n=64)
Accuracy 0.00 5.00 3.43 1.00
References 0.00 5.00 1.49 1.04
Style/ Readability 0.00 5.00 3.47 0.99
Overall Judgment 0.00 1.00 0.33 0.37
Overall Quality Score 0.99 3.32 1.84 0.56

Table 5.1 Dimension Characteristics.

Dimension Kolmogorov-Smirnov Shapiro-Wilk
Statistic Sig. Statistic Sig.
Wikipedia (n=64)
Accuracy 0.16 0.00 0.89 0.00
References 0.16 0.00 0.89 0.00
Style/ Readability 0.14 0.00 0.93 0.00
Overall Judgment 0.12 0.04 0.96 0.05
Overall Quality Score 0.29 0.00 0.78 0.00
Alternative Encyclopaedia (n=64)
Accuracy 0.08 0.20* 0.97 0.13
References 0.39 0.00 0.57 0.00
Style/ Readability 0.08 0.20* 0.97 0.13
Overall Judgment 0.13 0.01 0.94 0.00
Overall Quality Score 0.31 0.00 0.76 0.00

*p<0.05

Table 5.2 Dimension Distributions.

The sample characteristics in each of the languages and academic disciplines are presented in Table 5.3 and 5.4 respectively.

Alternative Encyclopaedia Wikipedia
Mean SD Mean SD
English
Accuracy 3.45 1.00 4.18 0.86
References 1.30 0.67 3.82 1.21
Style/ Readability 3.46 0.84 3.26 0.73
Overall Judgment 1.76 0.57 2.38 0.57
Overall Quality Score 0.30 0.45 0.32 0.29
Spanish
Accuracy 3.46 0.90 4.00 0.95
References 1.52 1.10 3.40 1.23
Style/ Readability 3.56 0.92 3.11 0.93
Overall Judgment 1.85 0.54 2.27 0.69
Overall Quality Score 0.36 0.34 0.42 0.34
Arabic
Accuracy 3.57 0.83 3.48 0.83
References 1.81 1.29 1.72 0.98
Style/ Readability 3.55 0.99 2.83 0.89
Overall Judgment 1.92 0.61 1.75 0.50
Overall Quality Score 0.34 0.30 0.38 0.34

Table 5.3 Sample characteristics according to language.

Alternative Encyclopaedia Wikipedia
Mean SD Mean SD
Humanities
Accuracy 3.85 0.57 4.30 0.66
References 1.38 0.75 4.13 0.75
Style/ Readability 3.96 0.63 3.43 0.35
Overall Judgment 2.09 0.50 2.57 0.34
Overall Quality Score 0.62 0.48 0.63 0.48
Social Sciences
Accuracy 3.50 0.83 3.80 0.75
References 1.55 1.25 3.24 1.12
Style/ Readability 3.86 0.81 2.91 0.84
Overall Judgment 2.01 0.57 2.18 0.59
Overall Quality Score 0.58 0.34 0.47 0.39
Mathematics, Physics and Life Sciences
Accuracy 4.28 0.48 4.24 0.99
References 1.97 1.26 3.19 1.72
Style/ Readability 3.58 0.89 3.42 0.78
Overall Judgment 2.07 0.57 2.31 0.62
Overall Quality Score 0.28 0.35 0.39 0.27
Medical Sciences
Accuracy 2.77 0.70 3.71 1.00
References 1.14 0.30 2.77 1.45
Style/ Readability 3.10 0.89 2.99 0.97
Overall Judgment 1.45 0.33 2.00 0.72
Overall Quality Score 0.11 0.21 0.25 0.30

Table 5.4 Sample characteristics according to academic disciplines.

The sample characteristics of the entire sample categorised according to whether the articles were reviewed by student or academic experts are presented in Table 5.5.

Student Experts Academic Experts
Mean SD Mean SD
Accuracy 3.64 0.94 3.74 0.94
References 2.16 1.34 2.40 1.53
Style/ Readability 3.15 1.00 3.80 0.85
Overall Judgment 1.96 0.63 2.03 0.63
Overall Quality Score 0.41 0.36 0.33 0.34

Table 5.5 Sample characteristics of entire sample according to nature of reviewer.

5.1.1 Overall Comparison across the Sample between Wikipedia Entries and Articles from the Alternative Encyclopaedias

The findings of the comparisons of reviewers' ratings of articles from Wikipedia and from the alternative encyclopaedias are presented in Table 5.6. Wikipedia articles were found to have scored significantly higher on the dimensions of accuracy, references, style/ readability and overall judgment (see Table 5.1).

Test Statistic P value
Accuracy U = 2645.00** 0.004
References U = 3285.00** <0.001
Style/ Readability U = 1533.00** 0.01
Overall Judgment U = 2638.50** 0.001
Overall Quality Score U = 2148.50 0.38

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.6 Comparison of article characteristics between Wikipedia and alternative encyclopaedias.

5.1.2 Comparison within each Language Group between Wikipedia Entries and Articles from the Alternative Encyclopaedias

The findings of the comparisons of reviewers' ratings of articles from Wikipedia and from the alternative encyclopaedias for English are presented in Table 5.7. Similar comparisons for Spanish and Arabic are presented in Tables 5.8 and 5.9 respectively.

In English, Wikipedia scored significantly higher on accuracy, references and overall judgment, as compared to the alternative encyclopaedia (Encyclopaedia Britannica) (see Tables 5.3 and 5.7). There were no differences between Wikipedia and Encyclopaedia Britannica on style and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 349.00* 0.01
References U = 458.00** <0.001
Style/ Readability U = 222.00 0.64
Overall Judgment U = 368.50** 0.003
Overall Quality Score U = 272.50 0.43

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.7 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in English.

In Spanish, Wikipedia scored significantly higher on accuracy, references and overall judgment as compared to the alternative encyclopaedia (Enciclonet) (see Tables 5.3 and 5.8). There were no differences between Wikipedia and Enciclonet on style and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 459.00* 0.03
References U = 578.00** <0.001
Style/ Readability U = 260.00 0.15
Overall Judgment U = 440.00* 0.01
Overall Quality Score U = 342.00 0.53

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.8 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in Spanish.

In Arabic, the alternative encyclopaedias (Mawsoah and ArabEncy) scored significantly higher than on style than Wikipedia (see Tables 5.3 and 5.9). There were no differences between Wikipedia and either Mawsoah or ArabEncy on accuracy, references, overall judgment and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 130.00 0.96
References U = 133.50 0.84
Style/ Readability U = 68.00* 0.02
Overall Judgment U = 113.00 0.59
Overall Quality Score U = 133.00 0.87

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.9 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in Arabic.

5.1.3 Comparison within each Academic Discipline between Wikipedia Entries and Articles from the Alternative Encyclopaedias

The findings of the comparisons of reviewers' ratings of articles from Wikipedia and from the alternative encyclopaedias for the Humanities are presented in Table 5.13. Similar comparisons for the Social Sciences, Mathematics, Physics and Life Sciences, and the Medical Sciences are presented in Tables 5.11, 5.12 and 5.13 respectively.

In the Humanities, Wikipedia scored significantly higher on references as compared to the alternative encyclopaedias (see Tables 5.4 and 5.10). There were no differences between Wikipedia and the alternative encyclopaedias on accuracy, style/ readability, overall judgment and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 12.00 0.34
References U = 16.00* 0.03
Style/ Readability U = 3.00 0.20
Overall Judgment U = 12.50 0.20
Overall Quality Score U = 6.00 0.69

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.10 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in the Humanities.

In the Social Sciences, Wikipedia scored significantly higher on references as compared to the alternative encyclopaedias, but the alternative encyclopaedias scored significantly higher on style/ readability as compared to Wikipedia (see Tables 5.4 and 5.11). There were no differences between Wikipedia and the alternative encyclopaedias on accuracy, overall judgment and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 238.00 0.30
References U = 341.00** <0.001
Style/ Readability U = 95.00** <0.004
Overall Judgment U = 218.00 0.28
Overall Quality Score U = 153.00 0.44

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.11 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in the Social Sciences.

In Mathematics, Physics and Life Sciences, Wikipedia scored significantly higher on references as compared to the alternative encyclopaedias (see Tables 5.4 and 5.12). There were no differences between Wikipedia and the alternative encyclopaedias on accuracy, style/ readability, overall judgment and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 191.00 0.37
References U = 230.00* 0.03
Style/ Readability U = 130.00 0.32
Overall Judgment U = 206.00 0.17
Overall Quality Score U = 198.00 0.27

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.12 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in Mathematics, Physics and Life Sciences.

In the Medical Sciences, Wikipedia scored significantly higher on accuracy, references and overall judgment as compared to the alternative encyclopaedias (see Tables 5.4 and 5.13). There were no differences between Wikipedia and the alternative encyclopaedias on style/ readability and overall quality score.

Dimensions Test Statistic P value
Accuracy U = 384.00** 0.001
References U = 409.00** <0.001
Style/ Readability U = 245.00 0.94
Overall Judgment U = 359.50** 0.006
Overall Quality Score U = 299.50 0.10

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.13 Comparison of article characteristics between Wikipedia and alternative encyclopaedias in the Medical Sciences.

5.1.4 Comparison between Wikipedia Entries and Articles from the Alternative Encyclopaedias per Cell i.e. per Language and Academic Discipline

The results of the intra-cell comparisons are presented in Table 5.14. It is difficult to interpret these findings without the raw data – a reporting of which is beyond the scope of this document. However, to summarise the findings, the far right column of the table contains the interpretation of the findings with reference to the database.

To summarise, in nine out of the ten cells, Wikipedia scored significantly higher than the alternative encyclopaedias on references. In one cell (English and Medical Sciences) Wikipedia scored significantly higher than the alternative (Encyclopaedia Britannica) on all dimensions. In another cell (Arabic and Mathematics, Physics and Life Sciences) the alternative scored significantly higher than Wikipedia on references, style and overall judgment.

Accuracy References Style Overall Judgment Overall Quality Score Interpretations Based on Raw Data
English & Humanities U=12.00 U=16.00* U=3.00 U=12.50 U=6.00 Wikipedia scored significantly higher than the alternative for references.
P=0.34 P=0.03 P=0.20 P=0.20 P=0.69
English & Social Sciences U=6.00 U=16.00* U=2.00 U=8.00 U=5.00 Wikipedia scored significantly higher than the alternative for references.
P=0.69 P=0.03 P=0.11 P=1.00 P=0.49
English & MPLS U=26.50 U=35.5* U=15.00 U=29.00 U=19.00 Wikipedia scored significantly higher than the alternative for references.
P=0.18 P=0.002 P=0.70 P=0.09 P=1.00
English & Medical Sciences U=58.50** U=59.00** U=51.00* U=57.00** U=52.00* Wikipedia scored significantly higher than the alternative on all dimensions.
P=0.69 P=0.03 P=0.11 P=1.00 P=0.49
Spanish & Social Sciences U=94.50 U=112.50* U=40.00 U=73.50 U=52.50 Wikipedia scored significantly higher than the alternative for references.
P=0.20 P=0.02 P=0.07 P=0.40 P=0.61
Spanish & MPLS U=33.00* U=36.00** U=28.00 U=36.00** U=33.00* Wikipedia scored significantly higher on references overall judgment and overall quality score. The alternative scored higher on accuracy.
P=0.02 P=0.002 P=0.13 P=0.002 P=0.02
Spanish & Medical Sciences U=41.00 U=55.00* U=20.00 U=38.00 U=30.00 Wikipedia scored significantly higher than the alternative for references.
P=0.38 P=0.02 P=0.23 P=0.57 P=0.88
Arabic & Social Sciences U=12.50 U=16.00* U=3.00 U=15.00 U=10.00 Wikipedia scored significantly higher than the alternative for references.
P=0.20 P=0.03 P=0.20 P=0.06 P=0.69
Arabic & MPLS U=7.50 U=3.00* U=4.00* U=4.00* U=15.00 The alternative scored significantly higher on references, style and overall judgment.
P=0.09 P=0.02 P=0.03 P=0.03 P=0.70
Arabic & Medical Sciences U=29.00 U=27.00 U=16.50 U=23.00 U=21.00
P=0.09 P=0.18 P=0.82 P=0.49 P=0.70

*p<0.05, **p<0.01. U = Mann Whitney U test statistic. MPLS = Mathematics, Physics and Life Sciences.

Table 5.14 Intra-cell comparisons of articles between Wikipedia and alternative encyclopaedias.

5.1.5 Inter-Reviewer Comparisons

There were no differences in the scoring of articles for both Wikipedia articles and in articles from the alternative encyclopaedias, based on whether the articles were reviewed by students or academic experts. These results are presented in Table 5.15.

Dimensions Wikipedia Alternative Encyclopaedia
Accuracy U=437.50, p=0.97 U=469.00, p=0.67
References U=441.00, p=0.99 U=527.00, p=0.13
Style/ Readability U=484.50, p=0.52 U=483.00, p=0.53
Overall Judgment U=370.00, p=0.84 U=378.50, p=0.35
Overall Quality Score U=443.50, p=0.32 U=493.50, p=0.41

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.15 Comparisons in ratings of articles (Wikipedia and alternative encyclopaedias) based on whether scored by student or academic experts.

The results of the inter-reviewer comparisons categorised according to language are presented in table 5.16 below. Spanish academics scored articles significantly higher for style/ readability and overall judgment as compared to students. Overall quality scores were significantly higher among students native in English, compared to academics native in English, although no significant differences were detected on any of the other four dimensions. A similar finding was found for overall quality scores among Arabic reviewers.

The results of the inter-reviewer comparisons categorised according to academic disciplines are presented in table 5.17 below. There were no significant differences between the ratings of articles by students and academic experts in the Humanities, Social Sciences, Mathematics, Physics and Life Sciences, and the Medical Sciences.

Student Mean (SD) Academic Mean (SD) Statistical Test p value
English
Accuracy 3.32 (1.07) 3.92 (0.84) U = 363.50 p = 0.13
References 1.97 (1.22) 2.69 (1.58) U = 350.00 p = 0.21
Style/ Readability 2.75 (0.97) 3.61 (0.82) U = 408.00* p = 0.02
Overall Judgement 1.72 (0.61) 2.22 (0.61) U = 391.00* p = 0.01
Overall Quality Score 0.28 (0.31) 0.44 (0.34) U = 340.00 p = 0.12
Spanish
Accuracy 3.38 (0.36) 3.58 (0.93) U = 117.50 p = 0.36
References 1.63 (0.92) 1.81 (1.21) U = 102.00 p = 0.82
Style/ Readability 3.24 (1.08) 3.17 (0.99) U = 91.00 p = 0.85
Overall Judgement 1.77 (0.39) 1.86 (0.60) U = 104.00 p = 0.75
Overall Quality Score 0.31 (0.26) 0.38 (0.34) U = 103.50 p = 0.75
Arabic
Accuracy 4.10 (0.85) 3.66 (1.05) U = 165.50 p = 0.15
References 2.63 (1.61) 2.52 (1.62) U = 214.50 p = 0.81
Style/ Readability 3.50 (0.90) 3.28 (0.71) U = 182.50 p = 0.31
Overall Judgement 2.28 (0.63) 1.95 (0.64) U = 155.50* p = 0.10
Overall Quality Score 0.59 (0.37) 0.14 (0.27) U = 83.50* p <0.001

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.16 Comparisons in ratings of articles (Wikipedia and alternative encyclopaedias) by student or academic experts, categorised according to language.

Student Mean (SD) Academic Mean (SD) Statistical Test p value
Humanities
Accuracy 3.85 (0.57) 4.30 (0.66) U = 7.00 p = 0.89
References 1.38 (0.75) 4.13 (0.75) U = 8.00 p = 1.00
Style/ Readability 3.96 (0.63) 3.43 (0.35) U = 7.00 p = 0.89
Overall Judgement 2.09 (0.50) 2.57 (0.34) U = 6.50 p = 0.69
Overall Quality Score 0.63 (0.48) 0.50 (0.19) U = 6.50 p = 0.69
Social Sciences
Accuracy 3.42 (0.86) 3.82 (0.72) U = 230.00 p = 0.29
References 1.86 (1.02) 2.77 (1.61) U = 232.00 p = 0.26
Style/ Readability 3.29 (1.08) 3.45 (0.85) U = 188.50 p = 0.92
Overall Judgement 1.92 (0.47) 2.23 (0.63) U = 226.50 p = 0.14
Overall Quality Score 0.50 (0.32) 0.54 (0.41) U = 189.00 p = 0.72
Mathematics, Physics & Life Sciences
Accuracy 4.40 (0.45) 4.22 (0.84) U = 106.00 p = 0.84
References 2.94 (1.47) 2.48 (1.66) U = 94.50 p = 0.51
Style/ Readability 3.27 (1.09) 3.51 (0.76) U = 125.00 p = 0.64
Overall Judgement 2.35 (0.69) 2.15 (0.58) U = 92.00 p = 0.47
Overall Quality Score 0.56 (0.42) 0.27 (0.25) U = 65.50 p = 0.08
Medical Sciences
Accuracy 3.30 (1.07) 3.22 (0.96) U = 173.50 p = 0.62
References 1.83 (1.39) 2.00 (1.33) U = 218.00 p = 0.62
Style/ Readability 2.70 (0.89) 3.17 (0.92) U = 248.50 p = 0.14
Overall Judgement 1.62 (0.66) 1.76 (0.61) U = 220.00 p = 0.46
Overall Quality Score 0.13 (0.23) 0.20 (0.28) U = 217.50 p = 0.42

*p<0.05, **p<0.01. U = Mann Whitney U test statistic.

Table 5.17 Comparisons in ratings of articles (Wikipedia and alternative encyclopaedias) by student or academic experts, categorised according to academic discipline.

5.2 Qualitative Findings

This section of the report summarises and discusses findings from the qualitative element of the research, in terms of the perceptions, opinions and judgments of the expert reviewers regarding the articles from Wikipedia, and other online encyclopaedias. In Section 3, we explained how reviewers – both professional academics, and graduate students – were asked to comment on the quality, accuracy, citability and style of a few articles each, in their own fields of expertise. As was shown in that section, we paired Wikipedia articles with similar ones from the following sources: online Encyclopaedia Britannica (English articles), Enciclonet (Spanish), Mawsoah and Arab Encyclopaedia (Arabic), removing all evidence of the source of each article. We asked reviewers to comment on a range of quality criteria, summarised as accuracy, incorporating validity, completeness, relevance, neutrality/ bias, currency; use of references; style/ readability incorporating conciseness, language, spelling and grammar, coherence, use of illustrative material, and enjoyment. Having commented on each of these aspects for each paper separately, reviewers were asked to compare the two ('Please use the space below to make any additional comments about the two articles in comparison with each other').

5.2.1 Academics' Qualitative Judgments

In this section we shall look first of all, in section 5.2.1.1, at how the academics in this study tended to make judgments, both positive and negative, about the full range of online encyclopaedia articles in the sample. Then we shall look specifically in 5.2.1.2 at the question of whether it is possible to identify strengths and weaknesses that are characteristic of Wikipedia articles in particular. Academics were simply told that the articles given to them for review 'have been carefully chosen from popular online encyclopaedias to overlap with your area of academic expertise', and were urged not to attempt to identify the origins of articles. Therefore we aim to detect whether judgments, made blind as to the identity of different articles, revealed characteristic patterns regarding Wikipedia articles.

Whilst different reviewers sometimes expressed contradictory opinions regarding the same article (which will be discussed further in Section 6), initial analysis has indicated that there was no marked pattern of differences of opinion between student reviewers and more established academic reviewers. For that reason, we include responses from both in the following sections of this report. We do, though, in the interests of transparency, indicate throughout the qualitative data whether a comment came from a student, an established academic or a professor. It should be noted that the term 'student' covered masters students, research students reading for a variety of research degrees including doctorates, and postdoctoral students. And again, we specify these distinctions when quoting comments from reviewers.

5.2.1.1 Academics' Judgments about Online Encyclopaedia Articles in General

In this sub-section, we try to capture the aspects of articles that earned either the approval or disapproval of the academic reviewers across the full sample of articles selected. The aspects praised and criticised below are fairly evenly distributed across all sources, and are intended to illustrate the kinds of judgments about online encyclopaedia articles across the sample that were generated by the experience of comparing and reviewing one or more pairs of articles on specific topics. So although we identify the sources of the articles in the following examples, again in the interest of transparency, we are not suggesting that the characteristics discussed in this sub-section are uniquely typical of that source.

It was evident on a number of occasions that academics, having considered the criteria put before them to help them with their quality judgments, and scored each of these criteria individually, very often went on to consider the combined characteristics of an article, balancing and synthesising the individual elements to arrive at an overall judgment. This is evident for example in the responses of two different reviewers for the pair of articles on Evo Morales, in which it appears that the overall feel and coherence of the article were valued more highly than quantity or currency of information:

  • "The second one [Enciclonet] is much better than the first one [Wikipedia]. It is shorter but contains all the important information up to 2006. It also has a better and more professional style." (Reviewer 3 – academic)
  • Speaking as an academic, I much preferred the second piece [Enciclonet] in this case. However, this was on principally intellectual and aesthetic grounds. For the most up to date information, and for more information about the various critics of Morales' governance, I would have to consult the first piece [Wikipedia] since this is absent in the second." (Reviewer 2 – academic)

For a number of academics, the impression of the article as a cohesive piece of writing appeared to be valued at least as much as the extent of subject matter, as can be seen in this comment on the pair of articles on Energia Renovable:

  • "The first artcile [Wikipedia] is more suitable and better descriptive of the subject matter. However, article 2 (Enciclonet) is half its size and can be read more quickly, which poses distinct advantages. Both are well written and accurate." (Reviewer 1 – Professor)

Even when lack of comprehensiveness (specifically here, lack of currency) is acknowledged as undermining its usefulness, the overall feel of an article still earns it some degree of approval, even if not actually rendering it preferable to the article that is more up to date:

  • "The first article [Wikipedia] had more information, but the second [Mawsoah] was much more eloquent." (Reviewer 1 – research student – on articles on Egypt)

By the same measure, when one article abandoned any attempt at providing an engaging narrative, it was viewed quite negatively:

  • "There is not a lot of writing in this article [Wikipedia]; mostly there are series of factual information in bullet point format." (Reviewer 2 – academic – Primary Education)

Criticisms of this kind were made about both Wikipedia and non-Wikipedia articles, and are quoted here in order to capture the impression emerging from the data that academics – while of course strongly concerned with accuracy, currency and comprehensiveness (also demonstrated in the quantitative results) – also judge articles of this kind for their capacity to bring a topic alive to the non-expert or casual reader. This theme runs through the comments from all three reviewers on the pair of articles about Polonomia:

  • "The second article [Wikipedia] is much clearer and concise, though it could need some additional information. The first article [Enciclonet], on the other hand, lacks focus and is rather inconsistent." (Reviewer 3 – academic)
  • "The first article is too confusing, poorly written and makes emphasis on one aspect of the theory. The second one is better written, gives a good overview, but does not cover in depth any topic. The scope of the first is larger but the execution is very poor. The quality of the second is higher, but it is too short." (Reviewer 2 – postdoc)

Certain tendencies, which to some extent crossed both language and disciplinary boundaries, are apparent here. Concision (which above all seems to have been taken to mean something along the lines of getting straight to the point) is valued a great deal by many reviewers (especially with respect to scientific articles) as is good writing – in terms of having a clear and informative tone of voice throughout:

  • "Article 1 [Enciclonet] goes deeper than the article 2 [Wikipedia]. In particular, notions of polynomial in higher algebraic settings are discussed and more theoretical results are cited. Also, it does a better job when discussing `factorización de polinomios' (polynomial factorisation). On the other hand, article 2 [Wikipedia] is way better written than article 1. It has the right encyclopaedic tone, including a very good introduction. Also, article 2 does the very important task of pointing to applications of the subject at hand." (Reviewer 1 – academic)

The following picks this up very clearly and reiterates the emphasis from many reviewers that for an encyclopaedia article to be 'well-written', crucially entails it being accessible to the kind of readership presumed to seek out online encyclopaedia articles:

  • "This article [Britannica] is factually correct and gives some interesting historical background information on antibiotic resistance. The article is written in a style that is simple, and this article should be accessible to both specialist and non-specialist readers. The article avoids an excessive use of technical jargon, and instead focuses on 'real' world examples of antibiotic resistance. Good, logical structure." (Reviewer 2 – academic – Antibiotic Resistance)

The same reviewer, discussing the Wikipedia article on the same topic, in fact recognised that this provided richer content:

  • "The second article [Wikipedia] provided much more detailed information on antibiotics and resistance, including very good citations to the scientific literature. However, the [...] article lacked organisation and structure." (Reviewer 2 – academic – Antibiotic Resistance)

Thus, values such as simplicity, accessibility, lack of jargon and good structure are very often emphasised alongside values more immediately associated with encyclopaedias, such as accuracy, comprehensiveness and currency of information. It seemed, generally, that the academics reviewing these articles were generally willing to accept certain deficiencies in online encyclopaedia articles, so long as they combined some degree of accuracy, currency and scope with an account that brought a subject to life for newcomers to an area. In that respect, substantial content was no substitute for the lack of an underlying dynamic or coherence in its account of the topic, which is clearly considered by these reviewers to be a problem with respect to each of the following articles:

  • "It [Wikipedia] is a very shallow article [...] there are many terms used. I did not find any reference for, the only reference used is inappropriate [...] there is no biased info in the article, because there is no controversy in the article, it is just explanation of the medical terms [...] there is no coherence in the article because it is just stating terms and jumping from one term to another without any connection." (Reviewer 1 – academic – on Pharmacokinetics)
  • "The second article [Britannica] is way too long; it is not good enough to warrant such a long piece." (Reviewer 2 – academic – on Memory)

In terms of articles judged as very poor, such as the two above, the factors that led to harsh judgments appeared to be common to all sources, and it is hard to locate anything specific to any encyclopaedia in such judgments. Strongly negative reviews of articles generally consisted of an accumulation of weak points in terms of accuracy, missing information, weak structure and lack of clarity, unredeemed by any strong impression of usability or readability:

  • "The text is too short and I don't think it is concise [...] The information is not well structured. In certain sentences we don't understand what the author is focused on [...] the use of the example is not in coherence with what the author intended to explain [...] It would have been useful to include a history section, and mention further topics such as interaction with logic and foundations of Mathematics [...] While most key ideas are included, no pointers are provided for the topics mentioned in the article, nor are there examples for them." (Combination of comments from all three reviewers for Wikipedia article on Mathematical Proof)
  • "The article is poorly written. Its language is stiff and it has a number of errors [...] The all-important relation of rational numbers ('números rationales') with real numbers is omitted completely. Indeed, there is no mention of real numbers at all [...] The grammatical and factual errors and the dull tone preclude the article from being useful in this regard [...] The article is repetitive and extremely boring. Moreover, it is pretentious to spend seven pages discussing such basic ideas [...] Would just confuse a non-academic reader. Too elementary [...] The article needs to be re-written, possibly from scratch. It has absolutely no value." (Combination of comments from all three reviewers for Enciclonet article on Numero Racional)

Overall, quite a small number of such articles were identified, whether from Wikipedia or other sources. In addition the article on Mathematical Proof quoted above, largely negative judgments about Wikipedia articles applied just to a small number out of the total of 22 Wikipedia articles: Pharmacokinetics, Percepcón (from the Spanish version), Primary Education and (according to just one of the two reviewers) St Thomas Aquinas. A certain number of negative comments were made about most articles, because the academic reviewers were generally rigorous in their judgments, but these were usually balanced or redeemed by a very fair identification of strengths. In the next section we focus mainly on Wikipedia articles in order to explore the balance of qualities that was identified in these specifically, and in order to see if it is possible to detect a particular mix of strengths and weaknesses that are particularly relevant to Wikipedia.

5.2.1.2 Academics' Judgments about Wikipedia Articles in Particular

When it comes to articles that were judged as being satisfactory or good, which is to say an article that is readable, and provides a useful point of reference or a good introduction to a topic, we argue that it was indeed possible to detect a particular pattern of qualities that were particularly characteristic of the Wikipedia articles within this sample at least. Not all of these qualities are wholly positive if taken on their own, but nonetheless constitute a set of characteristics which in combination outweigh specific weaknesses. The Wikipedia article on Hugo Chavez, for instance, illustrates this particular combination of qualities: "Generally speaking, the second article [Wikipedia] was much stronger than the first. It was far more comprehensive and detailed, it was up to date (going right up to the middle of last year rather than more or less stopping in 2005), and it offered a far more politically neutral interpretation of the subject." Both reviewers agreed that the Wikipedia article was the stronger on this particular topic, despite the fact that it did possess certain weaknesses:

  • "The only areas it fell down on were length (the second being three times as long) and [...] lacking a clear argument or unifying perspective about the subject, making it a little harder to isolate what the key issues for debate might be." (Reviewer 1 – academic)

In general, it appears that Wikipedia articles were distinguishable from other online encyclopaedias in the qualitative judgments with respect to the following characteristics if combined within a particular article: good coverage of topic, currency, quality of referencing, along with the less desirable qualities of redundancy and repetition. We would add to this also the non-appearance of a particular area of potential criticism – that of bias, as this did not appear to be an issue on more than a very few occasions in judgments made about Wikipedia articles.

Coverage of topic

  • "The second article [Mawsoah] can be part of the first one [Wikipedia] [...] the first article is comprehensive while the second one is just an introduction, so we can use some information in the second article which is missing in the first one, like the addictors story to complete the first one." (Reviewer 3 – academic – Parkinson's)

In the same spirit, all three reviewers for the article on Memory felt that, despite "minor flaws" (Reviewer 2), the Wikipedia article was superior especially with respect to its coverage of the topic:

  • "The first article [Wikipedia] is decent. It is reasonably concise, and covers most things that I would include – certainly it is not perfect, and there are things missing, but it is concise and well-written. By contrast the second article [Britannica] is very vague and makes minimal links to the actual original science behind the points [...] I actually think that it would be a little misleading to a novice, because the literature has developed so much in the last 10-15 years." (Reviewer 3 – doctoral student)

Currency

Comprehensive coverage tended to be taken to imply currency as well, and this was certainly an area where Wikipedia articles consistently scored higher than others in the qualitative judgements:

  • "I think that the real strength of this article is that it gives people a good overview of what Attention actually is. It covers the historical background of the research area, but also more up to date perspectives. It is also very transparent about the overlap between attention and other related areas of study, such as Working Memory and Executive Processes [...]"

Indeed, this article earns particular high praise from this reviewer specifically because of its currency:

  • "Everything in this article [Wikipedia] is stuff that I would have included had I written it myself. It is also very 'current' – all the stuff on disorders of attention in children is really very new [...] Everything that is stated as fact is pretty much accepted by the majority of the literature. I cannot really see any particular perspective coming through here. It is actually very carefully written." (Reviewer 1 – academic – Attention)

This judgment contrasts sharply with the comments from the same reviewer on the Britannica article on the same topic, which was described as "all very out-of-date, and therefore would be of no use in a current research article". In fact, it was consistently striking throughout the sample, that Wikipedia articles were nearly all considered to be more up to date than others, although being more up to date was not always a sufficient reason for a Wikipedia article to be considered the better of the two, if it failed on a combination of other factors such as clarity, cohesion or accuracy.

Referencing

The same is also true with respect to the presence of references. Wikipedia articles generally earned approval for their references, although – as we indicate below – these were not invariably judged to be advantageous. Wikipedia articles were clearly acknowledged as being more extensively referenced than others, even if in some respects it was not considered to provide as much information as the other article:

  • "The second article [Wikipedia] gives a clear idea of the nature of climate change and its science but doesn't give as much detail on its impacts as the first one. Overall, the second article is better structured, organised and referenced," (Reviewer 2 – doctoral student – Cambio Climatico)

When the references are considered to be appropriate, this would generally earn the highest praise from reviewers, as in the following comment from the other reviewer for this article: "References are broad, valid and of the highest quality available." (Reviewer 1 – professor – Cambio Climatico). Similarly, it is the scholarly nature of its writing, supported by appropriate references, that earns Wikipedia higher praise in the following two separate instances, even though it is by implication to some extent insufficiently comprehensive if taken alone:

  • "I preferred the 1st article [Wikipedia] to the second. It is written in a more scholarly manner and it provides a lot of references. I found the second paper still a draft, and this might be the case. Ideally you would combine the two to give a more comprehensive picture of preschool education." Reviewer 2 – academic – Preschool Education)
  • "The works cited are all of high-scholarly quality." (Reviewer 1 – research student – Anselm of Canterbury)

The mere existence of references did not, though, necessarily earn approval, as all three reviewers make quite clear with respect to the Wikipedia article on Parkinson's:

  • The references cited are all from internet and these references could be changed or removed from the internet [...] I prefer the use of medical data from published text books in the right way." (Reviewer 1 – academic)
  • "Referencing was poor throughout the article." (Reviewer 2 – academic)
  • "References used are internet websites, no journals or books are used." (Reviewer 3 – student)

A similar point is made about the Wikipedia article on Mutation:

  • "Many of the references are from websites, magazines, or other popular media, and not from primary source scientific articles." (Reviewer 2 – academic)

But the same reviewer pointed out, in discussing the fact that the Britannica article "mentions the mutation rate of HIV, but doesn't cite any HIV related material", that anyway, "much of the article is quite basic and does not necessarily need intensive citation." Thus the mere presence of references is not inevitably viewed as an advantage if (a) the references are of a generally low level and (b) the overall article appears to aspire to be simply a good basic introduction to a topic.

This reflects a more general feeling that a good article needs to balance its elements throughout:

  • "All the references are published by recognized journals or are books written by academics that work in the topic. [...] The article [Wikipedia] is concise and focuses on its topic. All the information provided is relevant and necessary [...] provided in a well-structured form." (Reviewer 4 – academic – Neurona)

    but

    "The use of technical terms is not accompanied by an explanation [...] it would be necessary to add new information to complete the map. [...] I would probably eliminate the topic about artificial neural networks." (Reviewer 4 – academic – Neurona)

For most reviewers, though, lack of references was sometimes seen as a negative feature, regardless of other qualities, as the same reviewer makes clear with respect to the Enciclonet article on Neurona: "It is a great article, well-structured, clear, and easy to understand and read. The information provided is precise and complete. However, no references are provided and no topics are treated in depth." (Reviewer 4 – academic)

Redundancy and Repetition

The following three comments on Wikipedia articles represent what was quite a common theme from many reviewers of Wikipedia in particular, which is to say up to date content with good coverage of issues, but at the same time a tendency to repetition and redundancy of content:

  • "A lot of information, including a very thorough account of the events of Anselm's life. Mentions his most important ideas and works and discusses them reasonably well. Cites respectable scholarly sources, for the most part. Doesn't read completely smoothly, a bit repetitive at times. There are some digressions and random sentences that harm the overall coherence." (Reviewer 1 – academic – St Anselm)

While being a bit repetitive does not necessarily constitute a serious problem, simply repeating the same information at some length is clearly seen as a potential source of distraction or potential irritation to the reader:

  • "Within different sections, the article is [...] generally well structured, although there are some cases where information is repeated in multiple sections. For example, information on the applications of antibiotics in genetic engineering is given at the end of the article and in the section on mechanisms of antibiotic resistance; within the section on mechanisms of antibiotic resistance, general information is provided on mechanisms of antibiotic resistance followed by very specific information on mechanisms of resistance to one class of antibiotics (fluoroquinolones)." (Reviewer 1 – academic – Antibiotic resistance)

The point made by this reviewer of the Wikipedia article on Evo Morales makes an important general point which does appear to have been made considerably more often with respect to Wikipedia:

  • "It feels a little disaggregated at times [...] The piece is fine but not exceptional in terms over overall coherence. It doesn't read as if someone has thought about the whole text as a reading experience, whether the author or editor. So it's fine if you're delving in to get a particular fact, but it doesn't work amazingly well as a singular read." Reviewer 2 – academic – Evo Morales)

The notion of the 'singular read' frequently surfaces in one way or another in article reviews, as indicated in the previous sub-section of this report, and is one that should clearly be considered seriously, in terms of its impact upon reading experience. More serious still, perhaps, is the suggestion from both the other reviewers for the same article that internal inconsistencies had resulted in actual contradictions in the content:

  • "I think that there is a very marked shift between the first 'biographical' part of the article, and the part that begins with the 2005 election victory of Evo. The first part is rather 'Evo-friendly' and relies almost exclusively on direct quotes from Evo. The second part of the article is rather more 'anti-Evo' and relies on newspaper references. It almost seems at times that it was written by two different people." (Reviewer 1 – research student – Evo Morales)
  • "Moves from statements that are too favourable to Morales to some statements that are too critical without enough support. Weak sourcing and bibliography. Would not cite in non-academic piece as not rigorous or well-organised enough." (Reviewer 3 – academic – Evo Morales)

However, there were in fact very few indications of any significant degree of internal contradiction identified in the broad sample of Wikipedia articles. The problem identified here concerns a lower level, but nonetheless important, issue of a lack of consistency and cohesion arising from the multi-authorship of articles.

Neutrality/ Bias

The issue of bias did not often appear to arise from this sample as a major threat to the quality of articles from Wikipedia, or indeed any other sources. It was generally referred to because the review process asked for a judgment on the question of bias, and reviewers were certainly careful to pay it due attention, even if occasionally they made it clear that the topic was not one, anyway, where issues of bias were likely to arise: "Not a very controversial topic" (Reviewer 3 – doctoral student – Mutation). Some references to bias suggest a slight modulation in the use of the term, such as in the review of Numero Racional, where the second reviewer considered that the Enciclonet article was "biased towards a formalist and algebraic point of view". In this instance at least it seems that ‘bias’ is used to described insufficient scope in the discussion of a particular topic, rather than deliberately preferential treatment for one particular point of view.

One of the few suggestions that a Wikipedia article showed bias of any kind came in one reviewer's comments on the Energia Renovable article:

  • "There is a tendency to disregard nuclear energy, particularly fusion, which is a flawed view commonly supported by green energy advocates." (Reviewer 1 – Professor)

It was, in fact, more often the case that Wikipedia articles were credited with a distinct lack of bias, even with regard to topics where the risks of favouring one particular viewpoint (i.e. with respect to historical or political issues) might be considered to be quite high:

  • "The second article [Wikipedia] was much more up to date, although not as much as it should have been. Both articles need to be updated – the first more so than the second. Given the ongoing political changes that are currently taking place in the Middle East, it is crucial to update these articles to reflect such pressing issues. Both articles were concise and fairly eloquent. The first one had more information, especially general information regarding climate and topography, etc. The second one focused almost entirely on political and economic issues in the Middle East. Nevertheless, the first one [Mawsoah] was much more biased and had a political tone to it, while the second one addressed such political issues from a seemingly objective point of view." (Reviewer 1 – research student – Middle East)
  • "a very controversial figure. I don’t really see how one could be much more neutral." (Reviewer 1 – professor – Hugo Chavez)

5.2.3 Qualitative Judgments related to English, Spanish and Arabic Encyclopaedia Entries

As mentioned in Section 3, owing to the challenges of securing reviewers to participate in the study within the timescales, not all articles were evaluated by the same number of reviewers. Similarly, numbers of reviewers taking part in the study varied by language. There were eight reviewers for the Arab articles, eleven for the English articles and fourteen for the Spanish articles.

Additionally, owing to the difficulties of identifying a publication with a sufficiently wide spread of articles, two separate Arabic publications were used: Mawsoah and the Arab Encyclopaedia. Of those publications, the two articles taken from the Arab Encyclopaedia (on Algorithm and Mathematical proof) received very positive judgements. Each of these articles was evaluated by three reviewers, whereas two of the four articles compared to Mawsoah (which were generally less well received) were only evaluated by two reviewers. Therefore, whilst the Arab reviewers tended to be more critical of Wikipedia articles than English or Spanish reviewers, such a count will not necessarily give an accurate picture.

Across subject domains, there are similarities in the criteria employed by reviewers across all three languages, for example in reviews of potentially controversial social science subjects which are subject to change over time, both Arabic and Spanish reviewers used wording relating to completeness, neutrality and currency in their assessment:

  • "The first article [Wikipedia] was much more up to date and discussed issues that occurred in 2011, while the second did not discuss anything that occurred in the 2000s, which therefore made it not very useful [...] When it came to political issues, neither article was sufficiently critical." (Reviewer 1 – research student – Egypt)

There was, though, some variation across the Arabic, Spanish and English examples with respect to opinions about the quality of language, where concern for traditional notions of language use appeared to be judged more critically by both Spanish and Arabic reviewers than was the case with English language articles:

  • "Use of Spanish is alternating between Latin American Spanish and Spain Spanish." (Reviewer 1 – professor – Energia Renovable)
  • "The article is not well written; the organisation is poor, the verb tenses inconsistent and there are sections misplaced. Just to illustrate [...] what does a 'relación sentimental e ideológica' mean? Ideological is normally not used to refer to a relationship." (Reviewer 3 – research student – Hugo Chavez)
  • "[...] weak and poor Arabic language that can't be understood." (Reviewer 3 – student – Pharmacokinetics)

Overall, our analysis suggests that Wikipedia articles were judged favourably more often than not (in some cases just marginally, in others quite markedly) compared with articles from Britannica, Enciclonet and Mawsoah, but this was not the case when compared with Arab Encyclopaedia, whose articles were more often judged favourably than Wikipedia articles.

5.2.4 Qualitative Judgments Related to Different Disciplinary Areas

The disciplinary divisions selected for the study reflected those used to structure subjects at the University of Oxford, where the study took place. As a result, the project divided up academic disciplines according to the four main disciplinary `divisions' of the University: Humanities; Social Sciences; Mathematics, Physics and Life Sciences; Medical Sciences. In reviewing the results of this study, we have found though that these disciplinary divisions, whilst helpful in logistical terms, did not constitute sufficiently clear disciplinary distinctions to be useful for analysis purposes. Therefore, we focused our attention in reviewing the qualitative data on disciplinary difference across two broad categories: 1) Humanities and Social Sciences, 2) Mathematics, Science and Medicine.

For the most part, reviewers in all subject areas worked well enough with the range of criteria against which they were asked to judge articles. In analysing their final comparative comments about articles, it is possible perhaps to detect distinct tendencies to prefer slightly different values in the humanities and social science article reviews, from those of mathematics, science and medicine.

For instance, terms introduced by reviewers into their discussion of history and social science articles included 'polished', 'eloquent', 'aesthetic', 'scholarly' and 'coherent'. By contrast, the key notions valued by reviewers of mathematics and science articles seemed especially to be those of scientific thinking, clarity and – above all – conciseness:

  • "The difference between the articles is very stark. The first is very waffly and never really gets on to the actual substance of what attention is / how it works. By contrast, the second article is concise and yet covers the important main points." (Reviewer 1 – academic – Attention)

Such fairly predictable and minor distinctions aside, though, it is not possible to add in any significant ways to the detailed analysis of the quantitative data concerning academic discipline variation as reported in 5.1 of this report.