User:Struthious Bandersnatch/Categorization and URL Hierarchies on Wikisource

From Wikisource
Jump to: navigation, search
Referencing a December 3, 2007 Scriptorium discussion and its follow-up in May 2008.

Categorization and URL Hierarchies on Wikisource[edit]

Okay, well I've read through the December conversation on this (thanks for the link, John) and I've looked at some of these pages that Eclecticology has created and I have formulated an opinion. Now everyone listen up, because my opinions are super-duper extra-special important since I wear Argyle socks.
First a bit more background on me and my experience with similar issues. My primary RL career is as a software engineer working specifically with web-based file management and content applications. Many of the rest of you obviously have similar experience. Most of my work experience is with applications written for and hosted on Windows servers, following party-line Microsoft development processes and techniques, but I've done a great deal of cross-platform integration work and privately use a lot of Linux and do lots of PHP programming. I've also done a great deal of specialized XML work, things like XML-to-relational-database mapping - it's all about the hierarchies. Also my academic background was in mathematics with some special focus on graph theory.
One of Eclecticology's justifications for these various topical pages, that there are multiple organizational hierarchies to be dealt with here, is something which is a familiar issue for me. Often, with any sort of content, there are multiple conflicting hierarchies which a particular content item must be located within and which the overall repository of content must be organized with respect to.
On the technical side of things it's often an easy or elegant solution to have one special hierarchy that overrides the others. Having a special hierarchy often simplifies some things but causes problems and inflexibilities in the long run. An example would be the single-inheritance class polymorphism model of Java and the Microsoft .NET programming languages. Another example would be the hierarchy of tags of an XML document, which is so embedded and given special status that compliant handling of XML forces the format of an "XML document" (basically a single root tag that wraps around everything else), causing the need for all sorts of jury-rigged "XML fragment" handling mechanisms, both proprietary and standards-based.
John's proposal is that the hierarchy created by the slash-delimiter of the URLs should be regarded as such a special hierarchy. This view has merits; to follow it would even more tightly conform to Jakob Nielsen's URL usability guidelines, which MediaWiki already does pretty well with. And as he points out it also provides a neat and usable definition of what a "work" is, one that could be easily parsed and utilized by a bot.
One point that John offers is that the slash-delimiter hierarchy, as being congruent to the Unix file system, is an extremely broad-based convention both on the Internet and within computing in general. But this point I object to.
IMHO Unix actually goes waaaaay overboard with the filesystem metaphor. In that OS family hardware devices appear as files within the filesystem, which caused a common security issue early on only really rectified in the early nineties where any user could view a the device file for a TTY terminal using a text reader and be able to watch things like, oh, a sysadmin typing his password as he logged on (in this case the filesystem metaphor was so dominant that simple file permission defaults were enough to allow any process arbitrary, direct access to hardware). Multi-threading issues in software are often resolved with a lock file rather than a lock object or register in memory, causing interesting (and often spectacularly catastrophic) problems if a drive is unexpectedly unmounted. Security is so tightly welded to the filesystem that it causes abstraction problems; a major challenge in porting the Apache web server to Windows early on was dealing with the fact that the Windows security system, well-tested or not, is much more policy-oriented and application-domain-oriented in design and much less filesystem-oriented than Unix/Linux.
This actually has caused me practical issues in RL work situations; some engineers coming from the Unix/Linux world have serious difficulty dealing conceptually with web applications that create a web site which is not modeled on a filesystem with a bunch of flat HTML files in it, but which, for example, takes its navigational structure from an XML file and doesn't care what its URLs look like. Some regard that as just plain wrong, as though some inviolate Tao of computing is being transgressed, and counterproductively will try to force cosmetic aspects of software to emulate a filesystem for no useful reason.
I don't think MediaWiki is too bad on this count but I have been slightly annoyed by a couple of needlessly URL-oriented aspects of the system such as the "namespace" concept, which has nothing to do with and does not resemble XML namespaces or programming language namespaces and I suspect probably is just an artifact of the way the database is structured (a guess based on the way that namespaces are used to scope searching).
While I respect John and Pathoschild's concerns, both technical and aesthetic, I do not believe that those concerns trump the value of what Eclecticology is doing. I think we should let Eclecticology continue unopposed and should formulate a way to delineate what a single work is that is more flexible than the proposed URL-based definition. Even if such a policy increases complexity on the technical side I think we will benefit in the long run from not tying everything onto the URL. I don't in general oppose the idea of creating and enforcing some standard for defining what constitutes a work, nor do I oppose mandating measures for authors to take that would facilitate bot operations, but mandating URL patterns for these purposes is the wrong way to go.
So that's what I think. --❨Ṩtruthious ℬandersnatch❩ 19:56, 22 May 2008 (UTC)