reckless intuitions of an epistemic hygienist ([info]gustavolacerda) wrote,
@ 2005-05-03 17:54:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:project_ideas

project idea: create an algorithm to identify an author's native language in English text
It shouldn't be too hard to create an algorithm to identify an author's native language in English text.

At least for those with a lower English level, it should be very easy to spot signature mistakes. For example, if a question begins with "what for [noun] ...", then the author's native language is very probably Dutch or German. (it's a literal translation of the Dutch way of saying "what kind of [noun] ...")

I wonder if a generic machine-learning technique would discover this pattern when fed with a corpus of texts labelled with the author's native language.

It should be easier than identifying the author's gender, in any case. Apparently, no one claims to guess the author's gender with more than 80% accuracy. I find this unsatisfactory.



(Post a new comment)


[info]bondage_and_tea
2005-05-03 04:40 pm UTC (link)
Here are some tests of your intutions -- what country did the speakers of this utterance come from?

"This allows to improve the efficiency"

(Reply to this) (Thread)


[info]gustavolacerda
2005-05-03 04:45 pm UTC (link)
"allows to improve"

A lot of languages, really. I think English is exceptional in requiring an explicit subject. AFAIK, it could be Portuguese, French or Dutch... i.e. I can't rule out any languages other than English.

(Reply to this) (Parent)(Thread)


[info]bondage_and_tea
2005-05-03 04:46 pm UTC (link)
Interesting.

(Reply to this) (Parent)


[info]xach
2005-05-03 08:02 pm UTC (link)
I have a doubt...what means VAR NOT BOUND in LISP????

(Reply to this) (Thread)


[info]gustavolacerda
2005-05-03 08:15 pm UTC (link)
"what means ... ?"

Again, could be many languages... English is also unique in using auxiliaries in virtually every question.

All I'll say is that most Dutch people speak better English than that.

Btw, I don't have the data or the knowledge to make these judgements in general... I'm only good enough in 4 languages.

(Reply to this) (Parent)(Thread)


[info]xach
2005-05-03 08:17 pm UTC (link)
How about "I have a doubt"? I see that error quite often.

(Reply to this) (Parent)(Thread)


[info]gustavolacerda
2005-05-03 08:24 pm UTC (link)
hm... I wouldn't call it an error myself. [?] rather than [*].

I can tell you that it corresponds to a frequent expression in Portuguese. For some reason, it's used as often or more often than "I have a question".

It should be possible to ask Google if a literal translation to French or Spanish occurs proportionately more frequently than in English.

(Reply to this) (Parent)


[info]gustavolacerda
2005-05-03 08:17 pm UTC (link)
Ok. "to be" isn't always an auxiliary, but virtually every question in English has either "to be", "to do" or "to have".

(Reply to this) (Parent)(Thread)


[info]gustavolacerda
2005-05-04 05:21 pm UTC (link)
or shall/should or will/would

(Reply to this) (Parent)(Thread)


[info]gustavolacerda
2005-05-04 05:22 pm UTC (link)
or "can"

(Reply to this) (Parent)


(Anonymous)
2005-05-04 05:10 pm UTC (link)
Non-nativity does not just make itself manifest in mistakes. Non-native authors often have a better or more stilted use of English grammar than native authors because the former learned the grammar by rules, and the latter by use.
--Sebastian

(Reply to this) (Thread)


[info]gustavolacerda
2005-05-04 05:20 pm UTC (link)
Sure.

Usage can also indicate a foreigner. For example, Dutch people often say "is it high time for ..." which is unusual but not incorrect English (where it is usual and correct Dutch). It's very hard to break away from such patterns... especially when there is no equivalent English expression.

I tend to cringe when I hear the colloquial "dit keer"... it's hard enough to tell de-words from het-words without exceptions to the rule.

(Reply to this) (Parent)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…