Shall We Become Transparent About Quality in Localization?

transparency has multiple facets

At the TAUS Quality Evaluation Summit in Dublin on June 8, 2016, a panel led by Antonio Tejada (Capita TI) discussed the topic of transparency in localization at large – and in multilingual content quality management in particular. The panelists who contributed to this discussion – and the co-authors of the below article – are Antonio Tejada himself, Anna Woodward Kennedy (Chillistore Technologies), Attila Görög (TAUS), Jeremy Clutton (Lingo24), and our own Kirill Soloviev (ContentQuo), who has also served as the article’s editor. 

The translation industry remains fragmented. Even today in our disintermediated era there are multiple participants in the same translation/localization workflow. Each participant (from the buyer through MLVs and SLVs down to the translator) has information that is not being shared up and down the stream. Yet, data on the impact, quality and productivity of translations is extremely useful to ensure efficiency and showcase the credibility of our industry.

Still, this type of information often remains a secret – or gets lost in the labyrinth of translation processes – because the mindset and the tools to unlock this data are lacking within companies. Why don’t we finally become transparent about the figures? And about our own quality and productivity and impact? An increasing number of buyers and their vendors, vendors and their translators have decided to collaborate on improving quality and productivity by offering each other full transparency, with mutual benefits.

At the QE Summit in Dublin, TAUS managed to gather several advocates of transparency from different parts of our industry for a comprehensive panel discussion of the topic: a localization industry veteran executive, an owner and operations director of a language quality services company, a global account director for a tech-savvy mid-sized LSP, and a co-founder of a technology startup focused on outcome-based localization quality management.

The panel was followed by a break-out session later in the day, where panelists were joined by like-minded audience members willing to discuss challenges with implementing transparent practices in their own translation & localization programs. Here’s a glimpse of the ideas that have been covered.

Why is transparency so important in localization?

Because localization is a complex system and it’s the only way to optimize it1

Transparency enables a holistic view of this system that’s required to achieve optimal performance. If we only focus on a few of localization subsystems (downstream processes) and/or super-systems (upstream processes) while ignoring the rest, we can all too easily fall into a trap called “local optimization”: improving the performance of a single component while degrading the overall performance of the whole system. To prevent that, transparent sharing of data from all system levels is essential: it’s the only thing that allows us to truly understand all parts of the system and their relationships.

Because entire localization supply chain can strongly benefit from it

Each stakeholder, from customer-side management/sales/marketing teams to individual freelance linguists, already has valuable information to share with the rest of their business colleagues – but they seldom do. This information, when properly aggregated and equally distributed, has the potential to inform strategic decisions both for LSPs (because they will better understand what’s actually happening with their customers and with their suppliers) and for enterprises (because they can use it to influence budget allocation for their globalization efforts).

Because delivering real value to translation customers requires it

Transparency clears the confusion between translation buyers and translation vendors during the purchasing process, especially in a scenario with multiple competing offers around price, service, and quality levels. It also focuses the discussion on what the customer is hoping to achieve with translation (i.e. the value it will bring to her business) and build a tailored service offering (as opposed to only considering internal industry concepts such as process details – these might not make much sense, especially for less mature buyers).

What are some examples of data that we want to share transparently?

Sharing data about quality and productivity of localization projects already seems like a given, and is within easy reach of most companies nowadays: TAUS Quality Dashboard and other complementary tools already provide an easy and affordable way to do that. However, there are also many more subtle – and sometimes more important – data points that are not yet being explored enough.

For example, translation buyers’ global business goals. Customers seldom buy translations in a vacuum. Instead, they buy translations because they want to achieve a specific goal – be it sales, growth, employee engagement, risk management, or compliance. Each vertical and company will have a different set of drivers, and being able to have an understanding of the goal can help the language service provider match up with the outcome and deliver what’s right for the client.

Then there is the actual impact of translations on company’s global expansion metrics. In the digital world, it’s relatively easy to access and compare business outcome measurements that are driven by translated content (such as conversion, engagement, and even sales – e.g. in eCommerce) – for any language and country where a product, service, or piece of content is made available. Compare localization services to dentist’s services: if a dentist’s work has helped you cure your toothache, would you ever consider hiding this fact from her? If not, then why are we still not sharing the numbers confirming successful (and occasionally failed) outcomes of localization with our translation supply chains?

Being open about translators allows the supply chain to be efficient (for example, by significantly speeding up the linguist-sourcing process for rare language/subject matter combinations). Localization vendors that still cling to carefully chosen translators as their most prized possession (and a key to a particular client) will likely find themselves outsmarted by vendors who know how to add value to customers with other means, as well as by dis-intermediating solutions like translator marketplaces soon.

Too often, quality assurance efforts focus on showcasing deficiencies in translation. Negative feedback frequently passes through supply chains in a rather transparent fashion, unhindered. However, offering praise on a translation that was well done can work wonders for translators’ morale and motivation. So it makes perfect business sense to deliver it more often. TAUS DQF even offers a dedicated issue category for “kudos” – use it to compliment your translators today!

What barriers exist to transparency?

It’s the year 2016 now, with Internet of Things and Big Data marching loudly across the increasingly digital and interconnected planet – and localization has to keep up. Yet, in many ways, our industry still follows a mindset that dates back to the 1980s – when printing multilingual user manuals for a microwave was a still thing, and when storing, analyzing, and sharing even moderate volumes of data was practically impossible.

Freelance linguists have been so subdued for decades that it takes tremendous effort to simply have them respond to the rest of the virtual project team and start taking part in discussions as basic as saying “Hello!”

Silos inside organizations are unfortunately typical, especially in larger companies, and hinder the most basic kind of transparency: internal transparency (which is supposed to be easier than sharing across organizational boundaries!). Arduous evangelization is required to pull all these people out of their shells and get them to collaborate and share, as everyone seems to have some little secrets they are not keen on parting with.

After you invest significant efforts to achieve transparent relationships with an important stakeholder, they sometimes disappear (e.g. leave the company, or a new vendor joins in their place) – and you have to start the evangelizing work from scratch. If a new stakeholder has a different level of visibility into privileged corporate information (e.g. a contractor), you might not even be in a position to share the required data anymore.

What’s in it for me, the localization professional?

Transparency allows buyer-side localization managers to become champions for global business expansion

Talking about localization impact on business outcomes is key to making that happen, but it requires transparent access to relevant data and infrastructure. How do I make localization not just efficient but effective, and maximize value-for-money for my company or my customer? That’s the real question that today’s localization leaders have to answer convincingly if they seek to remain relevant in the convergence era heralded by TAUS.

Transparency empowers individual linguists and enriches their jobs

By becoming an openly acknowledged and transparently accessible element of large virtual project teams, they get a chance at working much deeper with larger numbers of important stakeholders on more complex and fulfilling tasks (as opposed to simply being an anonymous line item in some translation management system).

Transparency enables LSPs to build longer-lasting and more profitable business relationships with their customers

After all, what agency doesn’t dream of being able to clearly show how much its translations have contributed to its customers’ top line?

Are there situations where transparency might actually hurt?

In rare individual occasions, being transparent has a risk of negatively impacting the global team’s morale, but only if not handled carefully enough. Examples of this are:

  • Employee performance assessments
  • Unclean, irrelevant, inconclusive, or incomparable data
  • Negative ROI (e.g. low sales in a country with continued localization investment)
  • Confusing being transparent with delegating authority for decisions (e.g. expecting a stakeholder to choose a localization process for you, as opposed to simply informing him of the choice made)

However, the benefits of transparency described above are infinitely more valuable, and making a manager’s job slightly more complex is never a good excuse for not harnessing transparency’s full power – as our panelists and break-out session members have all agreed upon.

We do hope that our own experience summarized in this article was useful – and encourage you to take a step towards making your own localization programs and relationships more transparent starting right now, today!

Original article published here: ContentQuo

Where Does Language Fit in with Big Data?


Where Does Language Fit in with Big Data?

For the diverse universe of digital content generated by big data to be useful, it requires transformation for different channels (such as web, mobile, and print), conversion for various applications, and localization for other markets. This is an area of opportunity for translators and interpreters.

Go to any conference and you’ll find a few new additions to the usual buzzword bingo of industry jargon—“big data” and numbers with lots of zeroes. You’ll hear about the massive growth in digitized data, how often a given sector’s knowledge base doubles, and what companies are doing to manage and interpret that flood of data. This burgeoning trove of bytes includes structured databases, application code, images, videos, and text. You’ll also hear about machine learning and how big data contributes to making software more responsive and useful to customers’ needs.

Just how much data are we talking about? Already huge, the digital universe of content, code, and structured data grows by a mind-blowing amount every 24 hours. Each day the world creates another 2.5 quintillion bytes of data.1 This data comes from many sources, including documents, social media posts, electronic purchase transaction records, and cellphone GPS signals. That daily infusion is estimated to pump the global repository of information from the 7.9 zettabytes (7.9 x 1021 bytes) available in 2015 to 176 zettabytes by 2025.2 Keep in mind that 1 zettabyte equals 1,000,000,000,000,000,000,000 bytes—an incomprehensible number.3 And that total doesn’t include the inestimable amount of content that is spoken every day.

Whatever the content being created, this truly immense volume includes massive, unrealized potential for translation or localization. So, what does this mean for the language industry, both humans and machines?

What Is Big Data and Why Does It Matter?

When we talk about big data, we refer to new ways of taking large amounts of data and using software tools to identify previously undiscovered patterns, trends, correlations, and associations. If you’ve ever bought a book because an online retailer told you that customers with viewing histories like yours have enjoyed it, you’ve been the beneficiary of big-data analytics.

This practice became possible because of the digitization of business, government, and everyday life over the past few decades. This information is stored in massive databases of structured data and repositories of documents large and small. We feed this growing beast with more bits and bytes every day. While all organizations rely on data to run their operations, a small but growing number use it to better understand behaviors, preferences, and trends in their world. Then, using those insights, organizations can make better decisions about how they market their wares, help their customers, improve operational efficiency, or build the next great thing.4

How do they do it? It’s not easy given the diversity of structured data and text. For highly structured data, software specialized to deal with big data draws from very large databases, often distributed around a network. Then, analysts employ a new generation of business intelligence and textual analysis tools to turn this raw data into usable information and actionable insights.5 They may combine transaction data with server logs, clickstream data, social media content, and customer e-mail texts, sensor data, and phone records to extract insights. They also extract insights using advanced analytical tools, including statistical analysis, data and content mining, predictive analysis, and text analytics. Traditional business intelligence and modern data visualization software help analysts present their findings in human-readable formats.

The language industry was actually one of the first areas of interest for big data applications. One of the early mainstream applications was in the statistical machine translation (SMT) efforts of Google and Microsoft. A 2011 Common Sense Advisory (CSA) report on MT trends characterized these statistics-based approaches to MT as big-data applications because they leverage large repositories of bilingual content. For example, they compare source documents in English to their human-translated Russian variants.6

In simplistic terms, SMT translates by comparing the zeroes and ones of the source file with the translation to find correlations and patterns. In other words, massive processing power allows computers to disassemble texts and their translations, analyze the patterns, and predict translations for texts they have never seen before. Such analytics has increased the speed of language support over earlier MT solutions that relied on teams of linguists to create grammars, code them as rules, create dictionaries of bilingual translations, and then constantly modify or add to the rules as they found exceptions.

The 2011 CSA report predicted that experts would apply these mathematics-based big-data algorithms to crack inter-language communication and marketing issues as they processed more languages and a huge volume of multilingual content. And that, in fact, is what has happened.

Over the past several years, MT based on big-data analysis has drawn far more usage than the first-generation rule-based solutions. Google Translate draws massive numbers of users, which is a testament to its easy access and perceived, if not actual, improvement of the quality of MT output. Although academic research shows improvements using popular quality assessment systems such as BLEU7(bilingual evaluation understudy), these changes are not cumulative and results vary widely between languages and translatable content types (e.g., regular text, audio, video, and social media). Thus, data on quality improvement is anecdotal and may be balanced by lowered user expectations for quality.

The availability of cloud-based computing with unlimited horsepower from the likes of Amazon Web Services and Microsoft Azure supports these big-data practices. This kind of harvest and analysis will continue to grow into the “Internet of Things” as many billions of devices come online (e.g., sensors, embedded controllers, wearables, health checkers, and widgets not yet invented).

To be useful, much of this content requires transformation for different channels (such as web, mobile, and print), conversion for various applications, and localization for other markets.8 Corporate and government planners already know it’s not enough to have all that digitized information available in just a single language. Their mission is to use as much data as possible to support customer experiences for the populations that really matter to them. Otherwise, it will be impossible to engage and retain international or domestic multicultural audiences.

Just consider the requirements necessary to translate that information into other languages to make it available to a broader audience. It’s estimated that it takes 14 languages to reach 90% of the world’s most economically active populations, but most websites max out with support for just six languages or locales.9 Product and document localization at many companies lags even further behind. Spoken-language interpreting is even more limited.

As the volume of data organizations produce grows, so too will ambitions to reach a greater audience for goods and services. Client-side respondents to a recent CSA survey reported that they plan to increase translation volume by 67% over the next three years, from an average of 590 to 990 million words per year.10 This increase is one that the language industry cannot meet with current methods, and that buyers in the CSA survey sample expect to address with a combination of post-edited content from their suppliers and raw MT.

Where Big Data Fits Today—and in the Future

Organizations are starting to realize that their plans for more translation could very well exhaust the capacity of all current translators, as well as those who will enter the field in the foreseeable future.11

To help keep up with the demand, many organizations are employing both productivity enhancements for human translators and MT to overcome the challenges associated with volume, turnaround time, the need to deal with more target languages, and flat budgets. Companies invest in human translation and post-edited MT for essential business content, such as product and marketing materials that are reasonably stable. For example, translation buyers rely on a large and growing cohort of providers that employ MT to pre-process the source material and then edit the output with human linguists. A small percentage of client-side organizations also use unedited MT output for business content, such as FAQs and knowledge bases.

Besides translating a limited set of business-oriented text, some buyers have increased the use of MT to process user-generated content, such as product evaluations, hotel reviews, and forum discussions that few organizations have bothered to translate in the past. But as research conducted by CSA indicates, online consumers and business buyers alike would prefer to have user reviews translated, even if these reviews are all that gets translated.12

Figure 1: Hypothetical Correlation of Translation to Daily Content Creation (note: "daily spend" = daily spending on language services) Source: Common Sense Advisory, Inc.

Why the Volume of Big Data Concerns Translation Buyers and Suppliers

Big data represents enormous numbers, and it turns out that one day in the translation industry barely puts a dent in its volume. Let’s focus on just the written word and how it relates to that 2.5 quintillion bytes of data being generated every day.

Despite today’s objective of making humans more productive to save time and money, the world is far from the nirvana of having enough online content available in all languages. From years of research and consulting, we know that any discussion about whether or not to invest in translation, localization, and interpreting has to begin with a review of available data.

CSA decided to investigate the enormous challenges facing the localization industry in terms of translating what should be translated from the totality of all data that could be translated. We decided to start with a given day’s output of digital content and determine what could actually be translated if we had the entire language industry working on just that content and none of the backlog of existing data.

What is this data? It’s everything that is digitally created every day, from documents to SQL data and telemetry to digital multimedia. For this hypothetical exercise, we began with the expenditure for outsourced services. It’s estimated that translation in various forms—human, post-edited, transcreation, plus website globalization and text-centric localization—accounts for US$26.4 billion of the $38.1 billion market for language services and technology.13

We then calculated the daily amount spent by the word. We divided US$26.4 billion by 365 days and estimated that the translation sector is worth US$72 million per day. At a hypothetical rate of 20 cents per word, we estimated that professional translators would process nearly 362 million words every day. We then converted that to bytes at the rate of 9.71 characters per word, which equates to seven billion bytes of double-byte characters. (Note that some languages have fewer characters per word on average and others have more).14

Finally, we compared it with the daily volume of content creation. When we divide the 2.5 quintillion bytes by the amount of target-language content produced by language services providers, we estimate that translation firms could potentially process just 0.00000000009% of the content created every day. However, we can safely assume that much of that data will never be translated—either the material isn’t translatable or translating it doesn’t make sense.

But some of what isn’t translated today (e.g., user reviews and social media posts) is on the future agenda of enterprise translation buyers as they strive to improve the customer experience. Even if we exclude all but an infinitesimal percentage of those daily bytes, the amount of content outsourced for translation is far less than 1% of what’s created every day. And remember that we are talking about the shortfall in translation for just one day. That number doesn’t address the backlog of content not yet translated.

As the results from this hypothetical exercise indicate, if the content is translated at all, it’s typically into just six languages online (and often fewer elsewhere). This is far short of the total number of online languages that really matter for both global and domestic communication and commerce.

Of course, there are many other variables and mitigating factors that affect these calculations. For example, consider in-house translation, languages for which you should translate but don’t, and the many zettabytes of existing content. The bottom line is that there’s an enormous amount of content that will never be translated or localized. That means opportunity for the language sector, and not just the technology companies.

What Big Data Means to the Language Sector

The big data and translation needs we discussed represent an opportunity for the language sector, but many translators look at the situation and worry that widespread deployment of MT will take work away from them. Our research estimates that translators will, in fact, lose some lower value jobs to MT, but that the total amount of work they have will increase at a steady rate for the foreseeable future.

If we also consider the expansion in post-editing—a contentious topic to be sure—we see that reliance on human professionals will grow faster than the current pipeline of future translators can add capacity. As a result, translators and interpreters will require productivity benefits from big data if they are to keep pace with demand. A few will take a much bigger step and become specialists who can build, train, and improve MT engines.

On the productivity front, we see that big data today trains statistics-based MT engines and could be used to supplement the post-editing processes of other MT models. Connections to MT are available in CAT tools such as Kilgray memoQ, Memsource Cloud, and SDL Trados Studio. Meanwhile, startups like Lilt use MT output in a CAT-like tool to accelerate human translation. We have also been briefed by software developers who are evaluating big-data machine learning techniques to improve terminology, translation memory, disambiguation, and a variety of other content creation, localization, and reviewing tasks. In short, big data will underpin most of the software tools translators use. Interpreters will also benefit as MT technology evolves for spoken languages.

What does big data mean for professional linguists? Just as they saw with translation memory and terminology management, linguists will have another tool at their disposal. Employers on both the end-buyer and agency sides will expect them to use this software to speed up their work and improve the usefulness of the output because of improved analysis of the source content.

Our 2016 survey of language services providers found that 49% of respondents have already committed to post-editing MT as a service.15 As early as 2012, our research showed that 21% of freelancers had experience using the technology.16

Some will move away from the classic translation agency structure to become big-data specialists. They will create clusters of industry- and domain-specific memories and harvest, analyze, and translate content. Content curation positions in which language professionals work with data applications to integrate relevant results to “enrich” them with useful metadata (e.g., topic categorization, classification of names and entities) are just now emerging.17 These positions will allow localizers to add market-specific value to content. Some will take the next step into the global marketing mainstream, adding to their portfolio services such as transnational business intelligence to help companies better understand their markets, or cross-language semantic and sentiment analysis to cull the opinions of consumers and business buyers out of multilingual content.

Big data has increased the volume of content dramatically. At the same time, automated content enrichment and analytical tools based on big-data science will enable the training of more sophisticated tools to help humans translate the growing volume of content and enable machines to close the yawning gap between what’s generated and what’s actually translated. No doubt some linguists will view these big-data-based innovations as threats. Others will view such advances as opportunities that will help them enhance the meaning of the source content, increase the usefulness of the other tools they employ, and increase their productivity in the process.

Although it has not happened yet, we speculate that MT driven by these phenomena could remove the “cloak of invisibility” from translators, giving them greater recognition and status.18 Even if machines generated the lion’s share of translation and humans did a smaller percentage, the sheer absolute volume of human translation would increase for high-value sectors such as life sciences, other precise sectors, and belles lettres. In turn, the perceived value of human translation could increase. Why? Because when you bring in a live human, it means the transaction is very, very important. It’s not so different from accounting. Software can handle routine tasks, but when problems arise or something is critical, you bring in a high-paid accountant to deal with it.

As interlingual communication becomes transparent, we predict that the number of situations where high-value transactions occur—i.e., those requiring human translators and interpreters—will go up, not down. If provider rates increase and companies use MT to address a larger percentage of their linguistic needs, human translators could benefit as they’re paid well to render the most critical content supporting the customer experience and other high-value interactions.

  1. “Bringing Big Data to Enterprise” (IBM),
  2. Legendre, Chelsea. “Year 2025—An Age of Machine Learning and Data On-Demand” (U.S. Department of Defense, April 27, 2016),
  3. Foley, John, “Extreme Big Data: Beyond Zettabytes and Yottabytes,” Forbes (October 9, 2013),
  4. Allen, Lisa. “What Is Big Data?” Forbes (August 15, 2013),
  5. Buluswar, Murli. “How Companies Are Using Big Data and Analytics” (McKinsey&Company, April 2016),
  6. “Trends in Machine Translation” (Common Sense Advisory Research, October 2011), 5.
  7. Papineni, Kishore, et al. “BLEU: A Method for Automatic Evaluation of Machine Translation,” in the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (July 2002), 311–318. BLEU, or bilingual evaluation understudy, is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another.
  8. “Content Strategy for the Global Enterprise” (Common Sense Advisory Research, April 2011), 11–14.
  9. “Global Website Assessment Index 2015” (Common Sense Advisory Research, July 2015), 2–8.
  10. “MT’s Journey to the Enterprise” (Common Sense Advisory Research, May 2016).
  11. “Translation Future Shock” (Common Sense Advisory Research, April 2012), 16–18.
  12. “Can’t Read, Won’t Buy” (Common Sense Advisory Research, February 2011), 46–47.
  13. “The Language Services Market: 2015” (Common Sense Advisory Research, June 2015), 2–5.
  14. This is a conservative estimate. The Unicode Consortium’s UTF-8 character encoding representation, which accounts for 87% of all non-binary data on the Internet, requires one to four bytes per character. However, European languages Roman script uses mostly one-byte characters. For more details, see pages
    12-14 of “Translation and Localization Pricing” (Common Sense Advisory Research, July 2010) and
  15. “Post-Editing Goes Mainstream” (Common Sense Advisory Research, June 2012), 6.
  16. “Translation Future Shock” (Common Sense Advisory Research, April 2012), 12.
  17. FREME Open Framework of E-services for Multilingual and Semantic Enrichment of Digital Content,
  18. “How Google Translate Will Increase Demand for Human Translation” (Common Sense Advisory Research, March 2010).

    Original article published here: The Ata Chronicle