#Corpus people: an engineer/programmer friend has a project coming up for a class, and they asked if there’s anything they could help with for my diss since they have no good project ideas. They could program something to help with my corpus stuff, turn that in, maybe get an article with me out of it.
Any… any good ideas on what could be generally useful for (ES) #corpora? I can try and come up with something just for me, but it’d be cool if it’s useful for the field at large too.
The project is apparently specifically on large data management, so corpus seems like a great topic, but… what do we do. Help.
@queerterpreter A universal annotation converter?
@grvsmth I've only used one program, so I'm not super sure of how much conversion is needed to switch from one to another. Wouldn't it be a fairly straight-forward find-and-replace RegEx? Like how translation tools will sometimes have <1>, or {1}, or a couple of things like that.
@queerterpreter Even if it were, something that collects all the regexes for each system would be helpful!
But it's actually more complex. There are inline and offset annotation systems, for one thing!
@grvsmth Hmmm, this could be interesting to her! Do you know if there's a good starting point I could direct her (it's HER final project, after all) to start looking into the different annotations out there?
@queerterpreter ahahahaha! Yes, I know people have compiled a list of annotation systems. Several lists, in fact! I've probably bookmarked some of them on my other computer, and there are some in my email inbox. Maybe there's a list of lists of annotation systems. This Wikipedia article is probably a good place to start!
@queerterpreter I’ve found generally that it’s hard to incorporate corpora ad hoc to an existing linguistics project.
Whether you’re annotating data or just looking for n-gram counts, both of those are relatively labor-intensive and are going to shape the direction of your project pretty dramatically from the start.