Friday, May 21, 2004

Intermediate Observations...

Hmmm...

At first blush it seemed hopeless. However, it is interesting to note that the results are somewhat stronger when the vocabulary and repertoire of sentence structures gets deeper and broader...

It still remains the case, however, that sensibility eludes us. Take a newly considered case... the preposition "to". This is an obviously handy word. It has two primary functions in English - adverbial and prepositional. As a preposition it generally expresses directionality or orientation "give that to me." However, as an adverb, it is a critical component of the verbal infinitive...

Today I've developed a 4700 word vocabulary. The word "to" has the grammeme of "000010010000000000". The set of all words with that grammeme is "about, along, through, upon, forth."

Now, today I incorporated polysemy information into the database. I think I've used that word right. Anyhow, I'm tracking frequency.

So, one possible approach to the case of a word like "to" is to allow the collection of frequency data to exercise it's course. I would imagine that with the passage of time the word "to" will gradually eclipse the other words in the set. It would occasionally express an infinitive construction as "forth [verb]" but would lean strongly towards the construction "to [verb]". Of course it would rather "[verb] to" than "[verb] forth" as well...

So, that's the cheap and unpleasant approach. It could almost be made sustainable by restricting the language-structure input to formal writing. Chatboard writing is inevitably going to burden Seductotron with terrible syntax. Formal writing might at least give him a fighting chance - no strings of nouns undelineated by commas or thoughts trailing off into space...

A more active approach, however, is to abandon the tabula rasa approach and adopt some kind of programmatic sentence analysis...

As I'm tentatively envisioning it, the idea would be to identify the critical building blocks of a sentence and then try to make reasonable guesses at which word is operating in which capacity in any given situation based upon the grammeme, then to identify the words according to their function rather than exclusively by their "part-of-speech-signature"... then somehow cross-index the word's definitional status with its usage info...

This would have the advantage of building upon the existing system (provided it works) and save me another complete re-write (ironically, the most "successful" versions appear to have been the earliest and "dumbest" versions). Of course, the disadvantage is that if the entire approach is a dead-end, I will have progressed further along it to no measurable benefit.

Also, before moving on to further elaborations and steps, I need to spend some time sorting out what I already DO have.

Specifically, the translation of sentences into grammeme/punctuation strings needs to be improved. Overall the system deals quite crappily with most forms of punctuation. The "ScrubWord()" function needs to be revised to accomodate the flexibilities and limitations of this new system, since it's a piece of legacy functionality from an earlier version.

Also, it will probably be good to use the .innerText property from now on. I think we could use this to turn Seductotron into a general purpose reader (i.e. a program who can read without regard to the HTML markers of the site) if we allow him to disregard all text that ends with a line break before it ends with punctuation. That would probably bring him really close to reading ANY site without previous knowledge of its structure. Then we would only need to learn the structure of sites with which we interact (possibly this blog, email, Slate's Fray, Instant Messenger, etc...)

0 Comments:

Post a Comment

<< Home