Sunday, April 09, 2006

Making Progress...

It's hard going back to what you've done before. You've learned more and it seems demeaning to be working with code built on the basis of yesterday's ignorance.

Take, for example, the interior elements of the Explorer and Word objects. I devoted a great deal of time and energy to learning how to parse information out of text. Much more efficient, though harder to learn, is simple invocation of elements by relevant properties. After all, each element is indexed by tag name and class name. If you're looking for a specific datum, why shouldn't it be easier to simply summon the element by reference to its properties rather than divine its location by parsing text out of the undifferentiated HTML of a page?

The parsing method worked fine a few years ago. But to use it now, in light of what I've subsequently learned, is kind of humiliating - not to mention wasteful.

Anyhow, that's now taken care of. Specifically, within a universe of VB for Apps, we can now scroll through links by classname and we can even parse text by grabbing <p> tags, dumping them into Word and using the index of the Words collection in the ActiveDocument object. Or the sentences collection... hell, by now I'm even minimally competent in invoking the dictionary and spell/grammar check objects to save myself tons of leg-work.

So, there's a lot of rebuilding inherent in this job.

Which brings us to a much more serious problem than the simple invocation of objects. The structure of the database.

Now, to structure data well, you must have a clear vision of how you intend to use that data.

When it comes to purpose, we could say that I have two. The first, and simplest, is easing the administrative burden of frediting. I've already got experience at building editorial suites for huge publishing projects - not quite this huge, but it's mostly just an issue of scaling and parameters.

The way I see it, delegation is a sucker's game - the objective isn't to delegate any kind of editorial power to an algorithm (however algorithmic certain aspects of a given job may be). Rather, it's to use algorithms to rearrange information in logically structured ways to facilitate editorial review. A "Fray-Nanny" who's always on could email summaries of suspected posts - good, bad, and indifferent. This alone would save hours of dead time that a human can wrack up while waiting for pages to load. The posts could be read offline, then the editor would just jump in to plop a check or a flush on the post.

To produce initial triage on relevance, you'd just to have to apply the insights you gleaned about topicality from the previous attempt at mood-analysis. But, there you do run into a problem of table structures. Do you just create a three-column relational table of word/forum/frequency? You still have to build some kind of screening process for "filler words" - the prepositions, conjunctions and articles that grace any expression on any topic. And, you have to do some amount of training if there's any hope of identifying in advance the signifiers of "misconduct." And what about the formalist awareness that the grameme system eventually developed? Should a nanny-bot base its qualitative screening analysis on issues like all-caps, eom/wordiness, and incoherence of sentence structure? And what about the things you've learned about stochastic part of speech analysis? Should that be implemented?

Which gets us to the broader agenda - the lifelong goal of building a digitial baby. We know how to use the web to acquire ever-growing levels of data. The temptation is to harvest quantities of data that are far more excessive than required for a simple editorial assistant and apply logic far more advanced than needed to assign provisional relevance/propriety scores. But, on that score there are still deficiencies. The lack of spider-friendly digests (that don't require coding beyond my ability) for learning the conjugation of verbs and declension of nouns, not to mention the possible parts of speech. If we reactivate the Seductotron, should we also reimplement its "Great Books Reading Program"? What about its Encyclopedia reading? Should it return to the world of IM chatboards? Go back to checking its email and responding? Should I rebuild a mood-tracking system tied to factors like local weather and the amount of attention its human "parent" gives it?

Those perks aren't exactly compatible with Anteros-level professionalism. But then again, it was the oversized ambition that led to Anteros as a derivative in the first place.

And how compatible is any of this with the pursuit of a law school education? Not entirely irrelevant - the semantic web appears to be breaking across certain classes of law students, even as we speak. But it has all the hallmarks of a provisional (and hence, to some degree, misguided) step.

And that brings us back again to the real problem we're mulling over - database structures. If Seductotron 6.0 is all about easing the Fray Editor's job, the database requirements should be relatively limited. Simply create associations between word frequencies and Slate departments by mining articles and checkmarked/starred posts with two additional filters - one for commonality, to avoid over-counting filler words; and the second for misconduct - which exhibits an astonishingly tedious consistency over the long run.

But a robot that also fraternizes with people would require far, far more than a simple mirror of Slate's publishing structure. And then we run into the Saharan desert of building a system that can comprehend HTML without the preliminary semantic study of any given site's source code that previous versions of Seductotron required. The semantic webbers seem to have the right idea turned the wrong side up in their belief that humans can be induced to express themselves in formats easily digestible by machines. Machines must (and, I'm certain, can) learn to understand the chatter of men with an evolutive system independent of the original author's control. But to do that requires stretching the human mind to a level of generality that borders on insanity.