Language data is big business. This sub-industry that deals with training corpora for language technologies ranging from natural language processing to machine translation is enjoying a resurgence thanks to AI.
Basically every language-related, AI-powered technology is driving demand, from speech recognition, sentiment analysis, question-answering and summarization, and of course, neural machine translation (NMT). Language data had always been necessary for technologies such as statistical MT, but NMT and any neural network-based solution is even more data hungry. What’s more, these technologies require high quality, domain-specific language data to provide equally high quality output.
The boom in language data has become so pronounced that companies like Appen have had a “truly outstanding” 2017, breaking through billion dollar valuation. The Australia-headquartered, Sydney Stock Exchange-listed company has two business lines: a Language Resources Division that provides datasets (audio, text, image and video) for training AI engines, and a Content Relevance Division that helps clients train AI driven products (mainly search engines) via human evaluation and feedback. And the growth has not stopped for Appen either, with their first half 2018 results have seen their shares reach an all-time high.

