lingo.lol is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for linguists, philologists, and other lovers of languages.

Server stats:

66
active users

#textprocessing

0 posts0 participants0 posts today
Harald Sack<p>Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI. </p><p>F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA</p><p><a href="https://sigmoid.social/tags/NLP" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>NLP</span></a> <a href="https://sigmoid.social/tags/LanguageModels" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>LanguageModels</span></a> <a href="https://sigmoid.social/tags/HistoryOfAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>HistoryOfAI</span></a> <a href="https://sigmoid.social/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://sigmoid.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://sigmoid.social/tags/historyofscience" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>historyofscience</span></a> <a href="https://sigmoid.social/tags/ISE2025" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ISE2025</span></a> <span class="h-card" translate="no"><a href="https://sigmoid.social/@fizise" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>fizise</span></a></span> <span class="h-card" translate="no"><a href="https://wisskomm.social/@fiz_karlsruhe" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>fiz_karlsruhe</span></a></span> <span class="h-card" translate="no"><a href="https://fedihum.org/@tabea" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>tabea</span></a></span> <span class="h-card" translate="no"><a href="https://sigmoid.social/@enorouzi" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>enorouzi</span></a></span> <span class="h-card" translate="no"><a href="https://fedihum.org/@sourisnumerique" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>sourisnumerique</span></a></span></p>
Holle Meding<p>🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows</p><p>At <a href="https://mastodon.social/tags/DHd2025" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DHd2025</span></a>, Nina Rastinger explores how well <a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> handles abbreviations &amp; NER:</p><p>✅ NER works well, even with small, low-cost models<br>❌ Abbreviations are tricky—costs &amp; resource demands skyrocket<br>🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive<br>Balancing accuracy &amp; efficiency in text processing remains a challenge! ⚖️</p><p><a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> <a href="https://mastodon.social/tags/NER" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>NER</span></a> <a href="https://mastodon.social/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://mastodon.social/tags/DigitalHumanities" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DigitalHumanities</span></a></p>
Pragmatic Bookshelf 📚<p>New at PragProg</p><p>Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease. </p><p>In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.</p><p><a href="http://pragprog.com/titles/d-snrem" rel="nofollow noopener" translate="no" target="_blank"><span class="invisible">http://</span><span class="">pragprog.com/titles/d-snrem</span><span class="invisible"></span></a></p><p><span class="h-card" translate="no"><a href="https://mas.to/@staffannoteberg" class="u-url mention" rel="nofollow noopener" target="_blank">@<span>staffannoteberg</span></a></span></p><p><a href="https://techhub.social/tags/regularexpressions" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>regularexpressions</span></a> <a href="https://techhub.social/tags/patternmatching" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>patternmatching</span></a> <a href="https://techhub.social/tags/regex" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>regex</span></a> <a href="https://techhub.social/tags/regexp" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>regexp</span></a> <a href="https://techhub.social/tags/textprocessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>textprocessing</span></a></p>
Veronica Olsen 🏳️‍🌈🇳🇴🌻<p>Back when I first wrote text processing code in the 90s on my Amiga 1200, I always used the ¤ symbol as a placeholder character for splitting and replacing to exclude things I wanted skipped without affecting character count. It was available on the Norwegian keyboard, and practically never used in text.</p><p>Recently I discovered that Unicode has two "Not a character" symbols perfect for the same usage: \uFFFE and \uFFFF.</p><p>They can be really useful!</p><p><a href="https://mastodon.online/tags/Code" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Code</span></a> <a href="https://mastodon.online/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> <a href="https://mastodon.online/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://mastodon.online/tags/Unicode" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Unicode</span></a></p>
Veronica Olsen 🏳️‍🌈🇳🇴🌻<p>2. Immediately after the split, replace U+FFFF with newline, but keep both versions of the line, and pass the one with the U+FFFF to the text paragraph parser. Everything else (like headings) gets the cleaned one.</p><p>3. After paragraph lines with a single break between them (belonging to the same paragraph) have been processed, THEN I replace the U+FFFF characters there.</p><p>It seems to work, but it took me like 3-4 hours to crack. 😅 </p><p>4/4</p><p><a href="https://mastodon.online/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> <a href="https://mastodon.online/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://mastodon.online/tags/Unicode" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Unicode</span></a></p>
Veronica Olsen 🏳️‍🌈🇳🇴🌻<p>I tried using the alternative line and paragraph separators from Unicode, but splitlines accepts them too. Then I discovered these Unicode characters:</p><p>U+FFFE &lt;noncharacter-FFFE&gt; not a character.<br>U+FFFF &lt;noncharacter-FFFF&gt; not a character.</p><p>The solution, then was:</p><p>1. Replace all occurrences of [br] with or without a trailing newline, using regex pattern "(?i)(?&lt;!\\)(\[br\]\n?)", with a U+FFFF character.</p><p>3/4</p><p><a href="https://mastodon.online/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> <a href="https://mastodon.online/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://mastodon.online/tags/Unicode" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Unicode</span></a></p>
Veronica Olsen 🏳️‍🌈🇳🇴🌻<p>This works fine in principle, but it is incredibly hard to figure out exactly when to make the replacement.</p><p>For instance, if I do it too early, the parser will split on the breaks as I use splitlines() early on. If I do it too late, I get double line breaks some places.</p><p>2/4</p><p><a href="https://mastodon.online/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> <a href="https://mastodon.online/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://mastodon.online/tags/Unicode" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Unicode</span></a></p>
Veronica Olsen 🏳️‍🌈🇳🇴🌻<p>I've been struggling with solving an issue with my text editor project. The editor is plain text and uses a blank line to separate paragraphs.</p><p>The editor has an option to preserve or not preserve single line breaks inside paragraphs when generating the output.</p><p>However, some users want to not preserve them, but still want to be able to add hard breaks sometimes. So I've been trying out using [br] as a hard break shortcode.</p><p>1/4</p><p><a href="https://mastodon.online/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> <a href="https://mastodon.online/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a> <a href="https://mastodon.online/tags/Unicode" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Unicode</span></a></p>
Veronica Olsen 🏳️‍🌈🇳🇴🌻<p>I started working on a Python class to write MS Office Word documents from already tokenized formatted text. It took me 5 hours to get a working version that can handle most of the formatting I need.</p><p>I have already done this with the Open Document format. It took me significantly longer, but I do steal some code from that code for DocX.</p><p>That said, DocX is actually easier to generate the XML for it turns out.</p><p><a href="https://mastodon.online/tags/Python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Python</span></a> <a href="https://mastodon.online/tags/Code" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Code</span></a> <a href="https://mastodon.online/tags/Documents" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Documents</span></a> <a href="https://mastodon.online/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a></p>
g z thompson<p>okay today's "put together presentation of ideas for tomorrow's meeting" has determined that i really need to learn a lot more about NLP</p><p>which i've been saying for months anyway but today it's like THIS IS REALLY NECESSARY NOW</p><p>So: what are essential NLP resources for a statistician/data scientist/ML practitioner? </p><p><a href="https://scholar.social/tags/NLP" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>NLP</span></a> <a href="https://scholar.social/tags/ML" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ML</span></a> <a href="https://scholar.social/tags/DataScience" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DataScience</span></a> <a href="https://scholar.social/tags/statistics" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>statistics</span></a> <a href="https://scholar.social/tags/python" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>python</span></a> <a href="https://scholar.social/tags/rstats" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>rstats</span></a> <a href="https://scholar.social/tags/chatbot" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>chatbot</span></a> <a href="https://scholar.social/tags/TextProcessing" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TextProcessing</span></a></p>