Wikisource talk:Monthly Challenge

From Wikisource
Jump to navigation Jump to search


Monthly Challenge

2021

  • Each month, the Wikisource community selects a few texts to proofread and validate.
  • The texts are featured for a maximum of three months with a few exceptions.
  • The challenge builds Wikisource's core collection and helps introduce new users to Wikisource.
  • Wikisource seeks to make free, scan-backed ebooks accessible to everyone.

Completion and rolling over of volumes of which only part is in the MC[edit]

In MC 2021-05, we have at least two works which are backed by scans that include lots of other matter: The Time Machine and Dorian Grey. Do we consider those works "done" for the purposes of the MC when the relevant sections are complete?

Obviously it would be nice to get the whole volume done, but, certainly for Lippincott's 46, that's a lot of material (800 dense pages) that are probably not going to get done, since they're not particularly "special". So it might make sense to allow them to be shunted off once they're validated?

If so, I might also add a field to the data table to allow us to forcibly mark an index "proofread" or "validated", even if not all pages are complete. Inductiveloadtalk/contribs 18:14, 14 May 2021 (UTC)

@Inductiveload: I've been thinking about this quite a bit. For periodical, I think it makes more sense to proofread them in parts. While it would be great to have the entire Lippincott's 46, much of that volume will attract little attention. Proofreading in bits will also make it easier to create scan-backed editions of major novels that were serialized in periodical form.
BTW, for Dorian Grey, the author is not being set in the epub export.
For next month, I'm thinking about replacing "Dorian Gray" with "The Sign of the Four" from Lippincourt 45. Agree?
For "The Time Machine", the entire Volume 1 should be proofread because all of it is a unique edition of H.G. Wells that worth keeping.
In general, I'm wondering if it makes that much sense to keep the texts in the MC after they've been proofread. Yes, validation is great, but it takes away from proofreading. Would we like to have one text that is 99.9% accurate or two texts that are 98% accurate? Scan-backing means that a text can be corrected at any point in the future and that might be a better approach to going through the Oldies backlog.
What would you think of this proposal
  1. Remove all proofread and fully transcluded texts at the end of the Month.
  2. For periodical, each Monthly Challenge must make clear which section(s) of the periodical should be proofread. Once those sections have been completed, the periodical will be removed from the MC at the end of the month. If the MC text is a serialization, the next part of serial will be advanced as soon as the previous part is completed even if this occurs before the end of the month. Languageseeker (talk) 19:57, 14 May 2021 (UTC)
@Languageseeker: I think it is perfectly fine to have bits of a periodical in an MC. For example, you can have a single work like Dorian or you could have a just a single issue of a volume at a time. Some periodicals like Lippincott's are so big I really don't think it's likely we'd ever get a whole volumes to proofread, let alone validated. Definitely make which bits are in the MC clear.
I'm neutral on adding works mid-month. Bear in mind that it will cause some stats to be adjusted (for example anything using total pages).
RE removing after proofreading, I dunno really. I guess it depends on how much validation we see vs. proofreading, which will take some time to stabilise to a steady state signal. Some people prefer to validate, others to proofread, so I don't know if keeping works around to validate will take too much away from the proofreading. If we scrap them after one month, we'll basically never see a validated work. I don't personally mind (I'm a 196%er myself) but it is part of Wikisource that some people really value.
Maybe just allow them to idle in the "to be validated" section, and if no-one does them (in general, I predict they won't, but I could be wrong), they'll expire naturally. We could move the "to validate" section(s) to the of the page? Remember that they'll get their own "one month old" section.
I'll have to think about the export author thing when exporting a section of a larger work.
Another edition of The Sign of the Four is already proofread from scans, so is there something we can do that would be a totally new thing? Out of all the bazillions of literary journals, there must be tons of stuff that's missing (or not scan-backed)? Inductiveloadtalk/contribs 20:22, 14 May 2021 (UTC)
@Inductiveload: I think that moving to To Validate Texts to the bottom is a good idea.
Most of the popular parts of monthly magazines were reprinted that is why we know about them. However, texts often changed between the magazine version and the published version. Proofreading the original magazine version creates a new way of experiencing an already familiar text.
There is no online edition of the Lippincourt text so that would attract fans of Sherlock Holmes to us. It’s only 100 pages so it’s not that much to proofread. I see it as a short and cool text to do quickly. Languageseeker (talk) 21:41, 14 May 2021 (UTC)
@Languageseeker: the exports now take the section_author field value if set. So Dorian takes "Oscar Wilde". If you were to export all of Lippincott's 46, it would not have an author set.
Fine, then add it to the nominations. I don't have a better idea for now. I don't disagree that we should have all the relevant versions, it's just I think we'd be better off having at least one version of everything than 2 versions of half as many things, especially when we know we have way less than half the things to start with. Again, whatever people want to work on.
...so that would attract fans of Sherlock Holmes to us. well, it would if we had, like, any messaging outside enWS :-/ Maybe we need a Tw*tterbot (insert your own vowel).
Inductiveloadtalk/contribs 16:08, 15 May 2021 (UTC)
@Inductiveload: Maybe, I'll save Holmes for another times. We have too many works already.
A twitter bot would be great for Wikisource in general. The more promotion the more users we will attract.
Also, could you add a Validation = true in the data so that we can move works such as Dorian to the To Validate section. Languageseeker (talk) 01:00, 19 May 2021 (UTC)
@Languageseeker: Yes check.svg Done - set status = 'proofread' to override the automatic proofread detection. Inductiveloadtalk/contribs 07:30, 19 May 2021 (UTC)

Duplicated titles[edit]

In the works listing, I see that Heart of Darkness is under the category Novel, while this current sprint at the time of writing is Novels. This comment might be in the wrong place, however I just want to note that that's there, and that's probably an accident. Testingitro (talk) 00:35, 22 May 2021 (UTC)

Planning for June[edit]

@Inductiveload: For June, I think that we should have the following changes.

  1. Make the root page Wikisource:Community collaboration/Monthly Challenge/Current Challenge.
  2. Retire Mathnawí early due to potential copyright challenges.
  3. Retire The Atlantic Monthly (Volume 1), The Strand Magazine (Volume 1), Nature (Volume 1) early because these serials are probably better dealt on an individual article basis.
  4. Would it be possible to create an award each month for user that proofread/validated the First, Second, Third highest number of pages?

Thoughts? Languageseeker (talk) 01:44, 27 May 2021 (UTC)

@Languageseeker: I'm not sure if the root page is a good idea to be the current month. It makes sense for getting people to the current works ASAP, but it hides the rest of the stuff. Maybe a nice big "current month" link? The bold "Monthly Challenge" link on the front page goes to the active month as it is.
I'll have a look at a way to "retire" works early. It'll have to go in the data table. We also need to have a way to record "completions" in the data table. Probably this should be manual, since it'll be fragile and awkward to work this out "online" all the time (in particular it's vulnerable to breakage down the line when the MC is long-finished and people mess with the index).
User awards are certainly possible, since the information is recorded in the DB. But, not all users appreciate being entered into league tables by default. I'm pretty sure there was once some kind of collaboration before where there was an opt-out (or maybe opt-in, and may not have been enWS) list that was used by quite a few people. Inductiveloadtalk/contribs 09:38, 28 May 2021 (UTC)
@Inductiveload: Can you set up the page for June? You know the infrastructure better than I do and I don't want to make a mess that requires untangling. I'm happy to add the volumes, but don't want to damage things.
For retiring texts early, maybe it's possible to just add a no display field so that the text can still be in the data table, but no longer visible on the page?
Good point about privacy. I always that opt-in is better than opt-out. It's also best to make the stats as anonymous as possible, e.g. Languageseeker_total = X not Languageseeker_total_for_work_y = x. Languageseeker (talk) 00:17, 31 May 2021 (UTC)
@Languageseeker: Pages now set up. The indexes need to be added to the category for the script to pick them up. Probably need automating one day.
I have retired Mathnawí (via the last_month parameter in the data table). For the other three, do you we have replacements? Or will First Folio be the only immortal work?
We also need to add new works for June to the data table, which I will do, but you should check and make adjustments if you don't agree.
We now have ~10 hours to break 3000 pages! Inductiveloadtalk/contribs 13:51, 31 May 2021 (UTC)
Great work! I tagged the new works for June. My plan is just to retire the periodicals and not replace them with anything. Maybe, we can get rid of the no expiry section and just blend Shakespeare in with the rest. It also might make sense to get rid of the two less than 50 pages section. Here’s to breaking 3,000. Languageseeker (talk) 14:07, 31 May 2021 (UTC)
For Shakespeare, that could work if we target one play at a time within the index (like we did for Dorian Gray within Lippincott's 45). In that case, we probably should keep it as no-expiry, because it'll take aaaages to chew into it. But the issue then is that the old months will show the current month's targeted section instead of what they were at the time.
At this rate it might be simpler to curate the data tables by month and manually copy rolled-over items, which means it's easier to make customisations without affecting past data tables. Not really an enourmous amount more work that currently, and actually might simplify modules. Inductiveloadtalk/contribs 14:17, 31 May 2021 (UTC)
Actually, targeting a specific play might not be a bad idea and might get more done. 20-30 pages is certainly less intimidating than 900+ and should be doable in 3 months. How about retiring the First Folio and replacing it with The Tempest (First Folio)? Languageseeker (talk) 14:23, 31 May 2021 (UTC)
@Inductiveload: Also, since the FF is an image based index. Shouldn't it be possible to create a index just for a play that will also updated the index for the entire play. This could make it even less intimidating. Languageseeker (talk) 14:37, 31 May 2021 (UTC)
@Languageseeker: I have changed to month-by-month data tables, so the June data is now at Module:Monthly Challenge/data/2021-06. This makes it a lot easier to have very fine grained control over what's in a month's challenge without having to special-case lots of indexes. I still need to fiddle with things like sprints, but the main listings are using the monthly data already.
For FF, I see where you're coming from, but I'm not sure breaking our "index is an edition" convention is a great idea. What we can do is split up the page list like Index:Ferrier's Works Volume 3 "Philosophical Remains" (1883 ed.).djvu. That breaks the interactive page lister but the page list is done so that's not really an issue for me. Inductiveloadtalk/contribs 15:10, 31 May 2021 (UTC)

Shakespeare First Folio[edit]

Is there a particular reason why someone has started another index page for this when there is already a version Index:Shakespeare - First Folio Faithfully Reproduced, Methuen, 1910.djvu that is approximately 50% progressed? Perhaps it's for the same reason that incomplete scans of Paradise Lost keep being uploaded. Chrisguise (talk) 00:30, 29 May 2021 (UTC)

@Chrisguise: Yes, there were a number reasons for doing so with these being the principle ones:
  1. The Methuen is a valuable and important text, but it does not come from a single folio, but a combination of folios. Therefore, it's a new edition of the first folio. The differences are probably minor, but I wanted there to be a digital edition of the First Folio that can be matched to an exact physical one.
  2. The scans for West 190 are full resolution and not a compressed DJVU which makes them easier to read. Furthermore, the Methuen text is a printed engraving of a photograph making it more difficult to read than the original. This also means that the Methuen will have lower quality woodprints than the original source.
  3. The Methuen is transcribed with certain modernizations of orthography such as the replacement of the long s with a regular s. I wanted to preserve the original orthography.
Languageseeker (talk) 00:21, 31 May 2021 (UTC)

Long-term series[edit]

We probably need to thing about long-term series, in that they're so enormous that they'll block out chunks of MC effort for years. For example, even if we can do a complete Carlyle every month, it'll be nearly 3 years.

I think we should make it a "thing" that if work fizzles on a series (for some informal and flexible definition of fizzle), it gets rotated at the next month and can be reinstated if people re-nominate it. Incomplete series can go on a "we tried, you wanna go?" pile for picking over and re-nomination.

The means we get rapid turn over (every month) unless people really like a series and show it by proofreading it, then it will stay as long as that continues. TL;DR play with it or put it back in the toybox, don't leave it on the floor. Inductiveloadtalk/contribs 18:01, 7 June 2021 (UTC)

@Inductiveload: Hmm, you're probably right. Maybe we could put them in a separate section. One for "To proofread (Long-Term Series)" and one for "To Validate (Long-Term Series)".
My thought behind these series is that they works are important enough that they should be done no matter how long it takes. I know that interest might fizzle out, but it's really hard to know when it will return. It's a sort of always there if someone wants to pick it up. What I don't want to happen is to have a huge pile of abandoned series that only have a few volumes done. Languageseeker (talk) 03:45, 8 June 2021 (UTC)
I'm trying to limit the number of spots that these series take by only allowing one in To Proofread and one in To Validate. I don't think that one will be added each month unless there is significant community interest. Languageseeker (talk) 03:45, 8 June 2021 (UTC)
@Languageseeker: I think leaving "fizzled" series on a priority pile for easy pick-up would be OK. So resumption of a previous series is preferred to a whole new series (assuming the same level of interest). I'm just wary of blocking out slots for months and months until we can refresh. I'd rather start a new series than keep a stalled one blocking the slots (but I'd rather finish an active series than start a new one).
Validation is easy to deal with: the volumes get three months (from the start) and then they drop off naturally as normal. If people want to validate any MC past proofread work, they can come back at any nomination. Without auto expiry, we'll end up buried in yellow works and spread validation so thinner and thinner and hardly anything will ever get done. We'll have to accept that many works will expire either un-proofread or un-validated, but make it easy to bring them back if interest sparks. Inductiveloadtalk/contribs 08:22, 8 June 2021 (UTC)
@Inductiveload: I see your point. Instead of taking a slot forever, maybe we can make a list of unfinished series. Let's treat long term series as normal works and give them three months. However, I still don't think that we should have more than one volume from a series in proofread or validation. Would it be ok if the previous volume isn't validated by the time the next volume is finished, it gets removed from the list? Languageseeker (talk) 04:22, 9 June 2021 (UTC)
Yeah, we should have a list, and reinstatement from that list should be preferred over new series if there is appetite (but better to have an active work than not).
I don;t think we should remove from the "to be validated" section early, even if new volumes get proofread. Validation is usually slower than proofreading, so that would pretty much ensure it was impossible to validate anything. After 3 months, we'll likely reach a steady state of the inflow of proofread volumes and the outflow of validated or expired volumes (I reckon we'll probably have ~2-4 volumes in the validation queue, mostly depending on the rate of proofreading). Or the series stalls.
Also, the validation queue is segregated by age, so probably the series will be split across the three sections (and the section is at the end). Inductiveloadtalk/contribs 14:48, 10 June 2021 (UTC)

Redundant works contained in series[edit]

Also, we should probably think about checking that volumes in the MC aren't covering material which we already have. For example, HG Wells v2 is Moreau and Sleeper, but we have complete, scan-backed versions of both of those. Same for v3. Whereas v4 is Anticipations, which we don't have a copy of at all. I'm not against filling in the series, but I'd rather focus efforts on things we do not have scan-backed versions of. After all, one of the stated goals of the MC is core text building. Inductiveloadtalk/contribs 03:21, 11 June 2021 (UTC)

I think that it might feel as if we're replicating works, but these are not exactly the same. For me the long-term series have to be more than a series of mere reprints. Instead, they either have to contain a contribution from the author, an illustrator, or an editor that make them a critical edition. For example, for the Atlantic Edition, H.G. Wells revised all of the material. Most of them have never been reprinted because they were in copyright. They are unknown versions of very well know stories. The rich paid to have slightly better versions of these texts and Wikisource is making them available to everyone.
I'm also wary of breaking up a series. At which point, do we stop? Should we skip only volumes or also remove parts of volumes? Realistically, nearly every long-term series will contain some duplicate title. I'd rather proofread a few extra volumes than have a bunch unfinished series. Languageseeker (talk) 12:38, 11 June 2021 (UTC)