Home
Gustavo Lacerda

> recent entries
> calendar
> friends
> My Website
> profile

Tuesday, May 3rd, 2005
5:54p - project idea: create an algorithm to identify an author's native language in English text
It shouldn't be too hard to create an algorithm to identify an author's native language in English text.

At least for those with a lower English level, it should be very easy to spot signature mistakes. For example, if a question begins with "what for [noun] ...", then the author's native language is very probably Dutch or German. (it's a literal translation of the Dutch way of saying "what kind of [noun] ...")

I wonder if a generic machine-learning technique would discover this pattern when fed with a corpus of texts labelled with the author's native language.

It should be easier than identifying the author's gender, in any case. Apparently, no one claims to guess the author's gender with more than 80% accuracy. I find this unsatisfactory.

(12 comments |comment on this)


<< previous day [calendar] next day >>

> top of page
LiveJournal.com