Bar Ilan University

Translation Studies Department


Ilan Bloch













Sentence splitting as an expression of translationese.


Seminar paper








Black Box Seminar

Dr Miriam Shlesinger








Over the past decade, as corpus-based research developed, translation universals have become a fashionable area of interest in translation studies. The multiple faces of these universals may take a wide variety of expressions. Sentence splitting is one of the means of expression shared by some universals. This empirical study aims at giving an insight on the use of sentence splitting as a means of implementation of such strategies in the translation of non-literary texts. In order to achieve this goal, a parallel translational corpus of technical texts was composed and occurrences of sentence splitting were sought and identified. The results obtained clearly show the existence of split sentences in the translations, as well as a different intensity of the pattern depending on the target language. It is hoped that further work, based on larger, fully qualified corpora, will confirm the results of this pilot-sized study.




Keywords: translation universal, sentence splitting, CAT tool, translation memory, corpus



Abstract. 2

Contents.. 3

Introduction.. 4

Translation universals – definitions and background.. 4

What are translation universals?. 4

Examples of translation universals. 6

1. Normalization. 7

2. Simplification. 8

3. Explicitation. 10

CAT, TM, Translation Units.. 12

Research context. 14

Methodology. 17

Results and discussion.. 20

Conclusion.. 23

Annex A. 25

Annex B.. 26

Bibliography. 27




Specific patterns have been identified in translated texts that seem to characterize translation in a unique manner. They are referred to as Translation Universals. In particular, simplification, explicitation and normalization are amongst the most intensively studied universals.

Universals may take various forms and expressions, from characters name adaptation to the use of superordinates instead of the hyponyms found in the source text. Sentence splitting may be an answer to the needs formulated by each of these three universals.

This empirical study aims at giving an insight on the use of sentence splitting as a means of implementation of such features in the translation of non-literary texts. In order to do so, I have composed a parallel translational corpus of technical texts and their translation in four European languages from English. One important specificity of this corpus is that CAT tools were used during the translation, with pre-defined segmentation rules.


Translation universals – definitions and background


What are translation universals?

Well before the generalized use of corpora in translation studies, researchers observed patterns in translational material. They formed hypotheses according to which all translations by nature call for some or many of these patterns to be expressed. As a first approximation, this is a gross, yet not incorrect, approach to the notion of translation universals.

In the late fifties, Nida (1959) proposed the semiotic law of loss and stated that translation does involve some degree of loss of information, necessarily. Translators may then purportedly decide to take specific actions in order to offer a solution, however in places no action is taken either because they are unaware of the problem, or because they have no intention to deal with the inevitable; eventually action may also be taken unconsciously.

This actively studied area gave birth to a rich set of naming conventions. Particularly, one would note the undifferentiated use of “translation universal” and “translation strategy”. The latter may be confusing since in this sense, such a strategy is not necessarily a conscious process as will be developed hereafter. Other researchers, such as Toury (1995:268), even prefer to speak of “laws” rather than universals. “Shifts” is also to be found in prominent works on the subject.

More generally, Laviosa (2002:43) defines translation universals as “linguistic features which typically occur in translated texts and are thought to be the almost inevitable by-products of the process of mediation between two languages rather than being the result of the interference of one language with another”. This notion of universal is related to Frawley's (1984) “third code”, the unique language at the meeting point of the source and target texts, languages and cultures. It is by all means a descriptive toolset which has been identified by observation and comparison of source and target texts on the one hand and comparison of original and translated materials on the other hand.

Echoing to Frawley, Gellerstam (1986) used the word "translationese" to describe this specific dialect that is translated language. Particularly, the Swedish author associates it with the "fingerprints" left by the source in the target language after translation. In his 1986 research, such fingerprints where identified in English-Swedish translated novels, especially at the syntactic level. An important characteristic of the notion of translation universals is that they are independent of the languages involved, although their expression is necessarily related to the source and target languages.

For Toury (1995:208), the translation of lexical items is “partly governed by a felt need to retain aspects of the corresponding source invariants.” This causes alien forms to appear in the target language, be it structures that are seldom or even never encountered in the target language, or lexical items which belong to the general lexicon but used in a somewhat different function, “semantic, stylistic and/or pragmatic”.

Toury (1995:210) also notes, further to Even Zohar (1990) polysystem theory, that the point in time where the translation-specific lexemes figure in both translational and non-translational text may differ and infer on the conservatory, altenatively innovatory, force of the usage.

Blum-Kulka (1986) has also developed the notion of translation universals in her investigation of shifts of cohesion and coherence.

By definition, coherence is a “covert potential meaning relationship among parts of a text, made over by the reader or listener through processes of interpretation” (Blum-Kulka, 1986:17), while cohesion is “an overt relationship holding between parts of the text, expressed by language specific markers”(ibid.).

Shifts of coherence are particularly hard to measure quantitatively since they address “the realization of the text’s meaning potential”. Blum-Kulka (1986:23) studied reader-focused (linked to change of audience) and text-focused (linked to the translation act itself) shifts of coherence. Going beyond Whorf’s proposition to promote a mutual influence of language and culture, Toury associates the reader-focused shifts of coherence to the norms of the target system (Blum-Kulka, 1986:24). She concludes noting that reader-based shifts of coherence are unavoidable, “unless the translator is normatively free to “transplant” the text from one cultural environment to another” (Blum-Kulka, 1986:29).

Text-focused shifts of coherence result directly of particular choices of a specific translator. This definition would certainly remind Toury's own distribution of obligatory (commended by the language) and non-obligatory (in response to other factors) shifts, if Blum-Kulka (1986:30) was not suggesting that text-focused shifts of coherence actually cast a doubt on the translator's ability to comprehend and transfer the source text into the target language: “I would like to suggest that the most serious shifts occur not due to the differences as such, but because the translator failed to realize the functions a particular linguistic system, or a particular form plays in conveying indirect meanings in a given text”.

Following these earlier research works, a more recent trend has been developed around the very existence of translationese, that is the identification of fixed lexical, syntactic and/or textual features in translated material (Puurtinen 2003). This follows a slightly different pattern. While a traditional method is based on comparison of original texts and their translation, i.e. translation-corpus based, this newer trend uses monolingual comparable corpora, by studying original texts as well as translated texts in the same language.


Overall, the use of corpora in translation studies, whether in a translation universals context or not, was largely pioneered by Baker in the early 90s (Baker 1993, 1995). The potential translation universals she identified were later further studied and confirmed and in cases, the hypothesis rejected (Granger 2005). Corpora, whether parallel or comparable, are now an obligatory point of passage in contrastive linguistics and translation studies.



Examples of translation universals


Predictably, the various classifications of translation universals are as numerous as the researchers who establish them. Andrew Chesterman, in his Translation Theory lectures at the MonAKO program of Multilingual Communication at the University of Helsinki, provides a 3-tier structure organized in syntactic, semantic and pragmatic strategies. Each category is sub-divised into an impressive number of universals, which also account for much earlier classifications by researchers such as Nida and Vinay & Darbelnet.

Out of the multiple reported strategies, and based on Laviosa's own review (2002: 43-70), I will particularly develop three universals:








1. Normalization


The normalization strategy is manifested by an effort to meet normative criteria of the target language. Scott (1996:112) formulated a definition of normalization as “the translator's sometimes conscious, sometimes unconscious rendering of idiosyncratic text features in such a way as to make them conform to the typical textual characteristics of the target language”. Vanderauwera's (1985) earlier findings , as reported by Laviosa (2002:55,68), illustrate Baker's (1996:183) views on normalization, as a tendancy to “exaggerate the features of the target language and to conform to its typical patterns”. In fact, Vanderauwera's studies of translated Dutch novels revealed greater conventionality, in respect of the target norms, in the form of adaption of Dutch names and references, the reduction of foreign language expression and the use of more standard punctuation.

May (1997) mentions similar normalized punctuation and syntax in her research on Russian translation of Virginia Woolf's fiction.

Interestingly, these stylistic features, which are an intrinsic component of Woolf's streams of consciouness, have been sanitized through the translation process, hence possibly betraying one of the purposes of source text. But didn't the proverb warn: tradutore, tradittore?

Numerous examples of research on normalization are readily available in all kinds of texts and languages. In his “law of growing standardization” Toury (1995:268) states that “in translation, source-text textemes tend to be converted into target-language (or target-culture) repertoremes”. Moreover, he discerns two poles of normalization which extend from systemic constraints of the target language (obligatory shifts) to normalization resulting from the translator's own preferences (non-obligatory shifts).


These earlier rather manual findings were confirmed by Scott's 1996 computer-aided research on the translation of the Portuguese negation não throughout the English translation of Lispector's novel “A Hora da Estrela”. Further conclusions were achieved with regards to the overall clarity and readability of this original work. Most importantly, the real advance in the work of Scott and other corpora pionneers is the development of a new, replicable, very promising computer-assisted methodology.


Furthermore, it is interesting to note that normalization is perceived by readers as non-translational. As Blum-Kulka and Weizman's experiment (1987) exposed, shifts from target culture norms, whether linguistic, stylistic, textual or even pragmatic, are widely associated with translated texts. Yet, Toury (1995:230) expressed his reservation with regards to this claim. More recently, Tirkkonen-Condit (2002) conducted a similar experiment requiring subjects to identify originals and translations from a collection of Finnish texts. She came to the conclusion that the frequency, alternatively scarcity, of unique items oriented the subjects’ appreciation.


Normalization, as a translation strategy, has many faces. It depends on a very wide and subjective set of parameters, among which time and space. One particular manifestation is the rearrangement of punctuation, such as for instance sentence splitting.


2. Simplification


One of the earliest empirical studies to bring forward simplification as a strategy is Blum-Kulka and Levenston's (1986) analysis of Hebrew and English translations. Reported in detail by Laviosa (2002:43), the study is yet much criticized and in many aspects rejected by the author. However, most of the elements studied by Blum-Kulka and Levenston are reported in other researches, although analyzed differently. For instance, Vanderauwera (1985:102-103), in her study of Dutch novels, also notes the use of modern, colloquial and simple synonyms in lieu of old, formal and affected source words.

In line with Blum-Kulka and Levenston, she noted the use of simpler syntax and other stylistic simplifications, particularly: breaking up of long sentences and shortening of overlong circumlocutions. However, Vanderauwera analyzed these as a process of normalization rather than simplification.


Blum-Kulka and Levenston explained the use of target superordinates to translate source language hyponyms as a response to a semantic void. Baker (1992) came to a similar conclusion and added paraphrase as another available strategy when no lexicalization of a source term exists, although it does not have the same status as “a stable word” (Laviosa, 2002:48).


Specifically, the reduction, and oftentimes omition, of repetitions and redundancies to be found in the source text has been widely observed both in literary translation (Vanderauwera, 1985 and Toury, 1991) and in court-room interpreting (Shlesinger, 1995). This process is likely to be seen as a stylistic simplification. Toury (1991:188) concurs that this is “one of the most persistent, unbending norms in translation in all languages studied so far”.


Later corpus-based, computer-powered studies on the simplification phenomenon yielded very interesting results as well.

Al Shahab (cited in Laviosa 2003) investigated vocabulary variety in Arabic-English translations versus English original texts. His hypothesis of a lower type-token ratio in translated texts was tested on radio broadcasts transcripts. He compared different English radio news broadcasts: original English BBC Radio Four (native speakers audience), original English BBC World Service (non-native speakers audience) and Damascus English Service (translated from Arabic). The results showed that the translational texts had a much lower type-token ratio than the original English texts. In other words, translated texts have less different words (types) in relation to the total wordcount (tokens) in comparison to original English texts.

In her own research, Laviosa (1998, 2002:60) created and selected the English Comparable Corpus as a main source for this descriptive work on patterns of lexical and stylistic simplification. Essentially three hypotheses were tested:

a. Lexical variety

“In a multi-source-language comparable corpus of English the range of vocabulary used in the translational texts is narrower than the range of vocabulary in the non-translational texts and this difference is independent of the source language variable.” (2002:60)


The use of less varied lexical elements in translational material, as opposed to non-translational texts, is a form of lexical simplification. Its universal character comes from the independence from the source language. This is another formulation of the hypothesis Al Shahab made and verified.


b. Information load

“In a multi-source-language comparable corpus of English the translational texts have a lower ratio of lexical to running words than the non-translational texts and this difference is independent of the source language variable.” (2002:61)


Information load may be measured by the ratio of lexical items to running words. A simpler text contains less lexical words therefore the ratio to running words is expected to be lower in translational, simplified texts.




c. Sentence length

“In a multi-source-language comparable corpus of English the translational texts have a lower average sentence lengh than the non-translational texts and this difference is independent of the source language variable.” (2002:62)


Short sentences are more readable. A simplification strategy may reasonnably be assumed to influence the readability of the text by shortening the sentences length.


This experiment confirmed both first two hypotheses, while the third was verified only for newspaper articles and not for narrative prose.


Whether sentence splitting is categorized as normalization or as simplification is rather a rethoric issue and I would tend to agree with Blum-Kulka that in some way the effect obtained is that of a simpler, more readable target text. The universal quality of the strategy of simplification by way of sentence splitting should not be compromised by Laviosa's observation. A universal may be expressed in particular in some sub-genres, while almost not in others, without there be an opposition in terms with regards to its universality.


3. Explicitation


The first use of the notion of explicitation is to be found in Vilnay & Darbelnet (1958) who set the ever since generally accepted definition as the process of introducing information into the target language which is present only implicitly in the source language, but which can be derived from the context or the situation.

However, explicitation was first claimed as translation universal only much later on, in Blum-Kulka's (1986) study of professional and non-professional translation from English into French.

Whether the observed rise in level of explicitness is due to a different normative use of cohesion markers in the source and target languages or in more a redundant target text after a complex processing, it may be viewed as inherent to the process of translation and this is precisely the object of the explicitation hypothesis Blum-Kulka formulated (ibid.).

At the time of her 1986 publication, large scale empirical data was still missing to confirm such hypotheses. However, more recent studies supported the hypothesis. Séguinot (1988) observed greater explicitness in translational material, French and English, in the form of added linking words and conversion of subordinate clauses into coordinate. However, she attributed this observation to the text revisers and their editing choices rather than to the translators.

Furthermore, Italian researcher Osimo notes that, for Blum-Kulka, this tendency to explicitate is a “spontaneous, irrational, uncontrolled constant” and concludes in agreement with Pym (1993:123) that explicitation has a cultural origin:

when you’re crossing a cultural wall, you encounter particular places requiring textual expansion. The most difficult terms tend to require some paraphrase or explanation, usually justifiable as the explicitation of implicit cultural information.”

Vanderauwera (1985), which experimental studies led to identify other strategies, also reports greater explicitness in the translation of Dutch novels. Mainly, she reported: the use of interjections, the expansion of condensed passages, the addition of modifiers and qualifiers, repetitions, more accurate descriptions as well as disambiguation of pronouns.

Baker (1992) observed similar patterns in a tentative by the translator to fill a cultural gap. In this specific example, abundant background information was provided to Arab readers and clarified a reference to US president Harry Truman.

Simultaneous interpretation is not foreign to the explicitation strategy and Schlesinger (1989) notes the use of synonyms or repetitions when dealing with substitutions and ellipses in the source text.

The correlation between explicitness and readability is direct, so claims Toury (1995:227) who suggested further empirical research on the topic be conducted.

Indeed, later corpora-based studies will bring additional empirical confirmation. For instance, Munday's (1998) analysis of a novel by García-Márquez and its translation to English, revealed the existence of shifts of cohesion throughout the translation.

While some authors, such as Barhudàrov (cited in Ossimo 2004), classify the explicitation in a wider “additions” category, others follow Blum-Kulka and agree that explicitation is in itself an intrinsic element of translation.

An interesting corpus-based study aimed at confirming Blum-Kulka's explicitation hypothesis is Øverås' 1998 work on the English Norwegian Parallel Corpus (ENPC). Both in English and in Norwegian, the observed tendency is to greater explicitness, however with different levels (Laviosa, 2002:64).

The manifestations of the strategy are essentially either additions of grammatical or lexical items in the translation or specification, that is expansion or substitution, of a grammatical or lexical item resulting in greater explicitness. Particularly interesting in Øverås work, beyond the demonstrated rigor and large variety of factors considered, is the neutralization of language-specific features by the use of a bidirectional parallel corpus.

Other corpus-based works have been led by Olohan and Baker (2000) on the omission and inclusion of the reporting “that” in translational and original English. This comparative study, predictably, concluded for higher syntactic explicitness in translated English.


All of the three strategies: normalization, simplification and explicitation, employ different mechanisms and methods of expression. Punctuation rearrangements and especially sentence splitting may serve their purpose. As such, this study will focus on occurrences of the phenomenon in a parallel translational corpus.






CAT, TM, Translation Units


As the gobal electronic evolution is taking place, the translation arena is not left aside and many recent developments reflect of a boiling-hot effervescence. The industry, and particularly the field of specialized translation makes regular use of concordance, terminology, Computer-Aided Translation (CAT) tools, and other interesting pieces of software on a day-to-day basis. Moreover, new developments and advances are proposed and published continuously.

Yet, it is still too often thought in less familiar circles that the introduction of computers in the translation process impacts on the free-will of the translator and overall translation quality, as if "computer-aided" would systematically stand for automatic translation, by the machine itself. It is far from being the case. While the advantages for the translator and the translation are numerous, it is most important to remember that the translator keeps complete control on her work at all times. As Kenny (1999:67) puts it, these tools are merely "an attempt to aid translators in their inherently human endeavour".

Immediate advantages of the use of computers in translation include recycling of previously translated material. The triple goal of internal (within the same document), horizontal (with other similar documents) and vertical (with previous versions of the same document) consistency is hence achieved. In addition, the time and money savings are important aspects of the question too. Indeed, why pay twice as much when the fruits of a previous effort may be used again, and more so, why work twice as much when you may use your talent where it is truly needed and recycle previously translated elements?

The concept of CAT tools and translation memory is extremely simple. The translator does not translate twice the same sentence and is reminded of a previously done translation when a potentially similar source is met.

The software is set to parse the text according to user-defined rules: sentence by sentence, paragraph by paragraph,... In standard conditions, technical translation projects use a sentence-based segmentation. Therefore, the software parses the text until the next hard punctuation mark: period, exclamation mark, question mark, paragraph mark and colon. This defines the portion of source text that is considered at any one given time by default by the CAT tool and henceforth by the translator. Each translation unit so defined is hence presented in turn by the software to the translator for translation. Interestingly, the translation unit of the CAT tool is somewhat different from that defined by Vinay and Dalbernet (1958:37) "the smallest segment whose internal cohesion prevents from a separate translation of its constituents", here the choice is rather made on technical or economic grounds. Other times, other realities.

The unit is translated. The translation unit and its translation are stored, in perfect alignment, in a database: the translation memory. When a new translation unit is opened and the source text is presented for translation, the program checks in the translation memory, the database of previously completed translations, whether the source unit exists and suggests the matching translation to the translator. At this point, the translator may choose to use the suggested translation as is, because it is the best choice in this circumstance; to use the suggestion in an altered form, in order to adapt to the specific context; or to freely translate the source unit to whatever seems fit. The unit and its translation are stored in turn in the translation memory for future usage.

This process creates two by-products: the translated file and the translation memory. The translated file is immediately usable as any translated text, on the other hand, the translation memory will be used in upcoming translations for which the translator may wish to use her previous work. Different strategies are opted for, whether to build a single all-purpose translation memory, or to keep it client-, industry-, or even product-specific. Each option has its own advantages and drawbacks, but this discussion would be beyond the scope of this document.

The use of CAT tools is therefore very flexible but for one aspect: the segmentation. Indeed, the definition of the translation units is set at the begining of the work. Although, it may be changed at any given time, it would become very cumbersome to change it too often whithin the translation and in fact, it is in practice not changed. Based on a variety of grounds, translation projects that use CAT tools are often imposed TM settings by the client very early in the process, so that translators see it as a possibly unfortunate, yet very real constraint with which they will have to deal in order to complete the project. Parenthetically, for the sake of better understanding the reasons behind the constraints, it should be noted that in many cases the translation agency maintains a master translation memory for each client or product line. It is therefore crucial that such technicalities be observed throughout the life of the translation project in order to allow safe and easy merging of the memories. Hence, the translator performs her task, while being silently forced to adopt a predefined segmentation strategy. Of course, this would not be bearable in poetry and other literary contexts, but seems accepted and quite well enforced in technical translations.

Much beyond our scope, but too insightful to be left aside, Heyn (1998:135) mentions the possibility that translators translate differently when using CAT tools and TMs so as to be able to recycle more easily in future projects.






Research context


This research will explore three translation strategies, namely simplification, explicitation and normalization, through occurrences of sentence splitting in CAT-based translations. As discussed earlier, sentence splitting may be associated with all three universals. The methodology will be presented in the next section, however we will already note that the study is corpus-based, more precisely Translation Memory-based. By essence a CAT tool translation memory is a perfectly aligned bilingual translational parallel corpus where each source text is present together with its translation in the target language. Furthermore, the source text is segmented in translation units and to each source unit corresponds its translated counterpart.

There are of course other ways to produce such aligned corpora. The most obvious, and probably most labor-intensive, is to manually link source units to their translations in the target texts and somehow keep the linkage live for future use. Alternatively, alignment software are usually available with most commercial and freely distributed CAT tools suites. These software bundles contain not only translation tools per se, but also a wide variety of other programs such as terminology management, project management or alignment. Typically, such alignment tools require the source and target texts in input and process them automatically to generate the aligned bilingual document. The latter may then be used as such, or more likely be exported to a translation memory for further usage. Of course, the program's output often needs some manual fine tuning in order to faithfully reflect the alignment of the source and target texts. Whether manual or automatic, post-translation alignment may turn out quite unexpextedly cumbersome as the relations between the source and target texts may take the most varied forms. Sentences might be split through translation, in which case one source sentence yields two or more target sentences; some sentences might be combined, hence many sentences are translated into one single sentence. But also entire sentences may be omitted, or added, and in a last option, the whole text may be reorganized in such a way that sentence-level is not appropriate to compare source and target and align the texts.

Lastly, aligned corpora may be produced during the translation by using a CAT tool which automatically generates a translation memory. Such a corpus, as those that have been used in this study are by construction perfectly aligned, since the translation tool imposes the segmentation of the source text and therefore the choice of the translation units. While splitting is an option, sentence combination is impossible, as is complete reorganization of the text. Omitions and additions are technically possible though, and it would be interesting to study how widespread they are in the world of technical translations.

As Laviosa (1998) claimed corpora-based studies seem to suggest that translated text is more explicit, more conservative and less lexically dense than comparable original text. All three universals discussed above are clearly perceived in this claim: explicitation, normalization and simplification.

Whether sentence splitting serves the aims of any of these translation universals, it induces simpler syntax and hence a lower level of complexity of the target text, which becomes more readable. Our hypothesis establishes a complexity order between the target languages studied in a way that:

fr > it > de > nl

In other words, following Cosme’s conclusions in her study of clause combination (2003), I expect to find repeatedly in different random samples the most occurrences of sentence splitting in Dutch and the less in the French translation. Likewise, I assume Italian will rank second and German third on this decreasing scale. It is just fair to note that, although such previous work as Cosme's helped form this hypothesis, it is also greatly affected by personal introspective factors based on my knowledge and familiarity with these languages.

Interestingly, Baroni and Bernardini observed that relative frequency of punctuation signs is very similar between translations and original texts (2005:18). They comment: "on the one hand, they might suggest that punctuation use (and indirectly sentence length) are less relevant than other textual aspects to the identification of translationese. On the other, they might be taken to suggest that punctuation removal is too rough a way of getting at sentence length, and that other artifices are needed. Or they might simply hint at the fact that the SVMs [Support Vector Machines] are not making use of these features in the first place, which does not imply that they are not relevant to the original/translated distinction. Again, further study is needed in this area. " (ibid.)

Far from destabilizing the basis of this work, this call for further study is precisely being answered here and I do not propose sentence splitting is a translation universal as such, but only one feature of translationese among others.

As a matter of fact, in addition to the methodological reserve Baroni and Bernardini bring up with regards to the applicability of Support Vector Machines (the technique they used) to the study of punctuation use, one fundamental aspect of the present work resides in the fact the studied corpora are bilingual translational parallel. I did not relate to universals as various patterns expressed in arbitrary original and translated texts, rather, my standpoint follows the translation process and the (un-)predicted transformations that occur in the translated texts that face their source.

The very principle of studies based on translational parallel corpora is shared by both disciplines: translation studies (TS) and contrastive linguistics (CL). However, as Granger warns, "failing to properly understand the nature of translated texts might lead them to attribute some difference between OL [original language] and TL [target language] to interference from OL when in fact the phenomenon may simply be a manifestation of a translation universal." (2005:22) I shall take this advice seriously and closely monitor the observed occurrences of sentence splitting.

However, when considering a phenomenon such as normalization, Granger's phrase becomes somewhat puzzling. Indeed, the border between language interference and universal is lost and when the OL yields the way to a more normative target text, it is most definitely a kind of interference, albeit negative, as well as a recognized universal.

The results of corpora-based studies may be, as actually all empirical data, misinterpreted in a wide variety of ways depending on the reading grid used to analyze them. Particularly, an area of research that is so closely associated with two different disciplines should be dealt with with the utmost circumspection. Granger pinpoints that the "lack of familiarity with TS findings may lead CL researchers to interpret their data in terms of differences between language systems when they result from translation norms or strategies, while TS researchers may similarly misinterpret their data because of a lack of awareness of a systematic difference between the two language systems established by CL." (2005:26)

Likewise, Santos notes in her PhD research on grammatical translationese (1995) that the translationese may come from the closeness of the languages which may influence the frequency of translation universal occurrences. In such a case of course, fine judgement will be required to assess whether the occurrence observed is or not linked to a translation universal.

Furthermore, special care will be necessary in the interpretation of the qualified results as well. Indeed, I expect to find relatively few examples of sentence splitting for:

1.    it is a means not an end, it serves the purpose of translation universals,

2.    the three universals concerned have other means of expression, and sentence splitting is only one of them,

3.    a deeply anchored belief in the industry recommends keeping close to the source, and particularly keeping the sentence structure within larger units, and avoiding such alterations as splitting, combinations and others. Thus, the translator may be inclined to refrain from using splitting as an expression of the three translation universals mentioned. This would actually be a case of normalization. In order to adhere to the accepted technical norms in the industry of keeping close to the source structure, the translator would avoid the linguistic normalization that requires shorter sentences with less composed informational content.

However, each occurrence of sentence splitting will be all the more meaningful that the CAT tools impose limitations in the translation process, and particularly pre-defined segmentation.


All the warnings duly set, let us turn to the methodology followed in this study. Most of the studies brought in the literature have been based on literary texts, which, if only by status, account for most of the translation studies work material, however I deemed it interesting to extend our predecessors' efforts and observations to non-literary texts for two main reasons. The academic basis lies in the fact that descriptive translation studies may not ignore other translational material, be they more humble such as comic books, advertisements or technical communication. The practical side of the coin is that the parallel translational corpora, that were analyzed, were readily available to me in the form of translation memories of various documentation projects (software user guide and marketing collaterals).






In order to prepare the corpora needed for this study I used a collection of readily available translation memories. These were donated to science by a translation agency which maintains such memories as a regular practice in its daily business activities.

The memories contain only one target language and its source, hence I consider them as a bilingual translational parallel corpora. I analyzed memories in four target languages: Dutch, French, German and Italian. In all cases the source language was English.


Composition of the subcorpora:

1.         Dutch

The translation memory was built on files from a software user manual. It totalled 1409 translation units, which amounted to 29217 source words and 30164 target words.

2.         French

The translation memories for French were based on a software user manual (the same as in Dutch), as well as another software's website with a marketing orientation and some brochures of this second software product. It totalled 1482 translation units, which amounted to 32732 source words and 41022 target words.

3.         German

The German translation memories contained a software user manual (the same as in Dutch), as well as another software's website, marketing oriented, and a series of brochures and marketing collaterals for industrial printers. It totalled 2520 translation units, which amounted to 47006 source words and 45457 target words.

4.         Italian

Lastly, the Italian corpus contained texts from a website, marketing brochures and whitepapers for industrial printers as well as a software user manual (the same as in Dutch). It totalled 1555 translation units, which amounted to 33696 source words and 34882 target words.


This collection of memories subsequently defines a parallel multilingual subcorpus. Indeed, the software user manual has been translated into all four languages and is therefore included in each corpus. Although the source text is not exactly identical in each corpus, it is very similar and quite a large part of it is shared by all the translations. This subcorpus hence allows further comparison between the languages, or at least between their representation through these subcorpora, while the impact of an important variable such as the intrinsic properties of the source text is neutralized.

An excerpt of the corpus is found in annex A and shows the translation memory in its raw format prior to being analyzed.


The object of the research, split sentences, is defined by the straightforward equation :

1 > 2

In our context, it means that one source sentence has been translated as more than one target sentence, hence two or more.

The identification of split sentences is quite a simple process, essentially due to the particular conditions provided by a such perfectly aligned corpus as the translation memory of a standard technical translation project with default CAT tool settings.

As shown in the excerpt in annex A, each unit is presented as a couple of two constituents: the source part and the target part, its translation. By definition of the translation unit, the source part contain the source text comprised between the beginning of the unit and the next hard punctuation mark[1]. Although it is not always grammatically correct, this source part of the translation unit will be referred to as a sentence. Sentence splitting would have occurred during translation if a hard punctuation mark is to be found within the target part of the unit.

Therefore, in order to identify such occurrences of split sentences, I need to search the translation memory for such punctuation marks within the target part of the unit, that do not have a direct equivalent in the source.

A simple search query in a word processor program reveals the location of these marks. A manual review is then required in order to assess whether it is a real case of sentence splitting or a false positive, such as, for instance, the periods in an acronym. All the occurrences are collected for future analysis and categorization.

As rightfully suggested by Baroni (2005:8), the research for the occurrence of the pattern has been performed on each sub-component of the corpus, so that the findings may be more easily related to their texts of origin and the global picture is not averaged by higher and lower local phenomenons.


One predictable difficulty concerning the interpretation of the results is due to the very nature of the corpora analyzed. Technical translation, that constitutes our corpora, is a very wide domain which expands from user manual for various appliances to specification sheets giving much numeric details on particular technical aspects of the product. Hence, the corpora by definition contain many units that cannot be subject to the translator conscious or unconscious application of a specific strategy. All this noise will most certainly have an impact on the observations. An immediate corrective step has been to clean the translation memories, by manually removing such units that contain only numbers, or nominal sentences in technical specifications, of little value to this study. Furthermore, non-text such as formatting indications, font sizes and types was also removed as it affects the wordcounts and the general readability of the text to be reviewed. It will be interesting to reproduce this experiment on qualified corpora with less, if at all, polluting translation units such as phone numbers and lists of recommended working temperatures.





Results and discussion

The review of the translation memories presented above and the research for hard punctuation marks in the target texts that do not match a corresponding mark in the source has yielded the results detailed in the table hereafter:

















Translation units








Source words








Target words








Splitting occurrences








Ratio (split/total)









It is important to note that false positives have been discarded and do not appear in these results.

As mentioned earlier, false positives are cases of apparent sentence splitting which are due to other causes and factors completely unrelated to the translation universals at the heart of this study.

However, cases of transformation of hard punctuation marks into other hard punctuation marks, such as a semi-colon becoming a period, are accounted for in these figures. I believe they are a form of normalization of punctuation usage and hence do belong here.


A few examples are presented in Annex B.


By construction, the corpus does not allow the opposite phenomenon to take place, and sentence merging is impossible since the segmentation at the input, during translation, isolates sentences. As mentioned earlier, this rule was in practice not perfectly enforced, yet no occurrences of merging were observed.


Each column contains the results observed in a specific subcorpus of the various corpora studied.

Hence, A and B refer to two different collections of texts. As discussed earlier in the previous section, while collection A may not be the exact same in each corpus, it is very homogeneous throughout the languages. On the other hand, collection B is unrelated to A and not quite exactly identical in each language. However its scope and contents are very similar from corpus to corpus as the texts belong to the same genre of marketing oriented brochures.


Out of the 1409 translation units reviewed in the English > Dutch translation memory, which belongs to subcorpus A, 63 occurrences of split sentences were found. In each case, the translator deemed appropriate to translate one source sentence into more than one target sentence. Likewise, 1237 French, 2171 German and 1095 Italian translation units contained respectively 33, 74 and 8 occurrences of the phenomenon, all of these concern subcorpus A. In order to take full measure of the meaning of these figures, the ratio occurrences/total number of units reviewed is an interesting means. While Dutch strides ahead with 4.5% of split sentences, German and French total a lower 3.4% and 2.7% and Italian shows only 0.7% of occurrences.


The results observed in subcorpus B show 7 occurrences in the French translation memory, out of a total 245 units, 7 split German sentences out of a 349 units corpus and only 3 occurrences in Italian out of 460 units. The resulting ratios of occurrences are hence 2.8%, 2% and 0.6% respectively.


This set of results may be analyzed and discussed along two main axes, a vertical one spanning both corpora language by language, and a transversal one which allows to compare the performance of each language in the two corpora.


The vertical observation axis concerns only 3 languages since the subcorpus B was not available for Dutch. Both French and Italian show very coherent figures. While French had 2.7% splits in subcorpus A, a very close 2.8% was observed in subcorpus B. Likewise, both Italian results of 0.7% and 0.6% are also very similar. Interestingly, German behaved differently in both subcorpora with 3.4% and 2% splits.

A few comments and questions arise from these few results. While, French and Italian are relatively stable when switching corpora, German shows a different pattern with half as many more occurrences in one of the corpora. The order of magnitude is similar which is comforting in a way, as the phenomenon is observed in both corpora. A variety of reasons may explain the difference between the two observations.

First of all, the source materials are different in the two subcorpora, one is a software manual and the other contains whitepapers texts from a software product website as well as more marketing oriented materials presenting industrial printing solutions. As a matter of fact, the Italian and French subcorpora B are taken from similar yet different sources, the industrial printers and the whitepapers respectively.

Another possible reason lies in the fact that two different translators intervened in German subcorpus B, while in other languages both subcorpora were translated by the same translator. A closer look at the results for subcorpus B reveals that one of the two translators, who translated subcorpus A, produced significantly more split sentences than the other translator, who translated only part of subcorpus B. Yet, even if we limit subcorpus B to its "same translator" component, the difference between A and B remains. Objective scientific reasoning requires us to further check this issue by reproducing the results with another similar corpus.


On the other hand, the transversal observation shows the differences between the various languages based on the translation of very similar source texts, there is very large overlap between all the sources of subcorpus A, they are all from the same software user manual, by the same author and were written and translated during the same 12-month period.

The observed figures seem to indicate that germanic languages have more split sentences than latin-influenced dialects. Analysis of subcorpus A, which is quite homogeneous by constitution, shows that the Italian translation keeps close to the source in terms of sentence segmentation and splitting is barely found. French, however, shows quite more occurrences of the phenomenon, and German and Dutch have many more instances. Indeed, Dutch is on one side of the scale with 4.5%, followed by German with 3.4% while Italian is at the opposite end with 0.6% and French of mixed influence, has 2.7% occurrences.

Of course, one single experiment is far from being enough in order to come to clear-cut conclusions, but confirms part of my hypothesis as developed earlier. Indeed, on an increasing simplicity scale, Italian comes first, followed by French, German and Dutch. My hypothesis stated that French would turn out to be more complex than Italian, as far as sentence splitting goes.

These results validate only part of the hypothesis as Dutch is indeed simpler than German, with more split sentences, which is in turn simpler than French and Italian. However, the relative complexity of French and Italian seems to be opposite to what I claimed.

Further work is required in order to validate these findings and their reproducibility, as well as to understand the translationese goal achieved by the means of sentence splitting, whether it is none, one, two or all three of the translation universals, explicitation, normalization and simplification.


It is important to note that, in cases, the translation memories used had several instances of the same translation unit. Although it may not influence the results in their percentage form, it certainly causes noise which could be neutralized (see examples in Annex B). In addition, I noted during the analysis of the translation memories that the source materials still contained, in spite of our cleaning efforts, many instances of noun phrases, titles and single proposition phrases. These can obviously not be the source of a sentence split. This definitely impacts on the results and draws the proportion of split sentences towards the lower end artificially. A better sense of the power of the strategies would be achieved if objectively "unsplittable" source units were not a part of the corpus.

Further corpus-based studies using translation memories as the source material would especially benefit from cleaning the memories from such cases of double (multiple) occurrences and simple sentences. We would thus really approach the true potential of realization of sentence splitting and open the way to better understanding the translation universals it serves.







Many recent works have focused on the existence of unique features in translational material. According to Walter Benjamins' intuitions, this is not even surprising since translation is an independent mode, a genre in itself.

This study, with the dimensions of a pilot, shows clear evidence for the existence of a splitting pattern in translations. It is remarkable that different samples yield similar results, hence confirming the importance of the behavior. Furthermore, it is possible that the family the target language belongs to has an influence on the intensity observed, more germanic languages showing more occurrences than languages of latin origin. This would be of course a very preliminary conclusion, merely a hypothesis for further research.

Although only one sample was used for Dutch, we can consider with some degree of exactitude that additional samples would have been coherent too, since in the other languages additional samples showed consistent observations.

I believe, in line with our predecessors' works, that sentence splitting in particular, and punctuation changes in general, are means deployed to serve translation strategies, rather than universals per se. In other words, they are manifestations of the universals.

Further study is needed in order to prove these findings reproducible. It would be particularly interesting to perform such work on fully qualified corpora containing only texts with a potential for sentence splitting: sentences with more than one proposition, without any noun phrases. The latter might turn out to be a challenge in itself since current technical writing trends recommend the use of Simplified English. Such complex grammatical structures are hence already avoided at the conception stage. However, a more controlled corpus with only one sentence per unit would, in addition, allow easy access to information that was hidden to us in this study such as sentence length. Interesting patterns could be revealed by such additional details.

Last but not least, invaluable complementary information would be derived from a study based on a monolingual comparable corpus of original and translated texts. According to newer trends of investigation in corpus research (Puurtinen, 2003), much light is to be cast by such study on the understanding of the translationese phenomenon and its manifestations.

Based on corpus-based studies, research on universals brings answers and opens new horizons to more complete and advanced descriptive translation studies and to better understanding the mutual influence of language and culture, as Whorf suggested long ago.


Annex A

Excerpt of translation memory.



<CrD>16082004, 19:24:03


<Seg L=EN-GB>CORjet around the world

<Seg L=IT-IT>CORjet nel mondo



<CrD>16082004, 19:28:13


<ChD>16082004, 19:49:56


<Seg L=EN-GB>The Scitex Vision CORjet high-speed inkjet press is changing packaging and POP printers\rquote businesses around the world.

<Seg L=IT-IT>La macchina da stampa inkjet ad alta velocità CORjet di Scitex Vision sta dando una svolta al lavoro dei tipografi di POP e confezioni di tutto il mondo.



<CrD>16082004, 19:18:00


<ChD>16082004, 19:47:06


<Seg L=EN-GB>ASAP, a Minnesota supplier of printing-related services for the marketing industry is using its Scitex Vision VEEjet digital flatbed inkjet printer to take its business in new directions.

<Seg L=IT-IT>ASAP, un fornitore del Minnesota di servizi correlati alla stampa rivolti al settore del marketing, utilizza la stampante inkjet flatbed digitale VEEjet di Scitex Vision per espandere il proprio business.




Annex B


Examples of sentence splitting:


English source - Rate your photos from one to five stars to easily locate your favorites.

German target - Bewerten Sie Ihre Fotos mit einem bis fünf Sternen. So finden Sie ganz einfach Ihre Lieblingsfotos wieder.


English source - Courses are live and take place within an online class environment, where students communicate and interact with the instructors in real-time.

French target - Les cours ont lieu en direct dans un environnement de classe en ligne. Les étudiants peuvent ainsi communiquer et interagir avec les instructeurs en temps réel.


English source - Scitex Vision TURBOjet is designed to meet the high quality, high-speed and low cost printing demands of the screen-printing and offset industry, this system adjusted to the printing market growing demands, especially in regard to image quality, color range and printing cost.

Italian target - La Scitex Vision TURBOjet è progettata per soddisfare le esigenze di stampa di elevata qualità, alta velocità e costi ridotti nel settore della stampa serigrafica e offset. Questo sistema si è adeguato alle crescenti richieste del mercato della stampa, specialmente in relazione alla qualità dell’immagine, alla gamma di colori e ai costi di stampa.


With punctuation change (normalization):

English source - Select a layout from the layout list; you can see a preview of your selection on the right side of the screen.

Dutch target - Selecteer een lay-out uit de lay-outlijst. U kunt een afdrukvoorbeeld van uw selectie rechts in het scherm zien.


Noun phrase that causes noise

Error in getting your albums from dotPhoto.

New Product Launch Package

Your expert partner for consumables

25 KHz, 600 dpi, 20-30 PL, 5.9x0.8 inches, Viscosity of 5-20 centipoises




Baker, M. (1992). In other words: A coursebook on translation. London and New York, Routledge.


Baker, M. (1993). Corpus Linguistics and Translation Studies. Implications and Applications. In M. Baker, G. Francis, and E. Tognini-Bonelli (eds) Text and Technology, pp. 233-250. Amsterdam & Philadelphia: Benjamins.


Baker, M. (1995). Corpora in Translation Studies: An Overview and Some Suggestions for Future Research. Target 7(2): pp. 223-243.


Baker, M. (1996). Corpus-based Transaltion Studies: The Challenges that lie ahead. In H.Somers (ed.) Terminology, LSP and Translation Studies in Language Engineering: in honour of Juan C. Sager. p. 183. Amsterdam and Philadelphia. John Benjamins.


Baker, M. and Olohan M. (2000). Reporting that in Translated English: Evidence for Subconscious Processes of Explicitation?, Across Languages and Cultures 1(2): pp. 141-158.


Baroni M. and Bernardini S. (2005). A New Approach to the Study of Translationese: Machine-Learning the Difference between Original and Translated Text Literary and Linguistic Computing. Oxford University Press.


Blum-Kulka S. and Levenston E. (1983). Universals of lexical simplification. In C. Faerch and G. Kasper (eds) Strategies in interlanguage communication. pp. 119-139. London. Longman.


Blum-Kulka S. (1986). Shifts in cohesion and coherence in translation. In: House J. and Blum-Kulka S. (eds.), Interlingual and Intercultural Communication. pp. 17-37. Tubingen: Narr.


Cosme C. (2003). A Corpus-based contrastive study of clause combining in English, French and Dutch. BAAHE 2003


Even Zohar I. (1990). The position of translated literature within the literary polysystem, Polysystem Studies. Poetics Today 11(1). pp. 45-51.


Frawley W. (1984). Prolegomenon to a Theory of Translation. In W. Frawley ed. Translation: Literary, Linguistic and Philosophical Perspectives. p. 168. London: Associated University Press.


Gellerstam M. (1986). Translationese in Swedish novels translated from English. In Wollin & Landquist Eds. Translation Studies in Scandinavia, Proceedings from the Scandinavian Symposium on Translation theory.


Granger S. (2003). The corpus approach: a common way forward for Contrastive Linguistics and Translation Studies. In Granger S., Lerot J. and Petch-Tyson S. (eds) Corpus-based Approaches to Contrastive Linguistics and Translation Studies. pp. 17-29. Amsterdam & Atlanta: Rodopi.

Heyn M. (1998). Translation Memories: Insights and proposals. In Bowker et al. Unity in Diversity? Current Trends in Translation Studies. pp. 123-136. Manchester: St. Jerome Publishing.


Kenny D. (1999). CAT Tools in an Academic Environment: What are they good for? Target 11(1). pp. 65-82


Laviosa S. (1998). Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose. Meta 43(4). pp. 557-570.


Laviosa S. (2002). Corpus-based Translation Studies: Theory, Findings, Applications. pp. 33-87. Amsterdam & New York.


Laviosa S. (2003). Corpora and the Translator. In Somers (ed) Computers and Translation: A Translator’s Guide. Chap. 7. Amsterdam & Philadelphia. Benjamins Publishing Company.


May R. (1997) Sensible Elocution. How Translation Works in & upon Punctuation, The Translator 3(1). pp. 1-20


Munday J. (1998). A Computer-Assisted to the Analysis of Translation Shifts. Meta 43(4). pp. 542-556.


Nida E. (1959). Bible translating. In Brower, R.A. ed. On translation. pp. 11-31. Harvard: Harvard University Press.


Ossimo B. (2004). Compensation and Explicitation. Translation Course, available online at


Puurtinen T. (2003). Genre-specific features of translationese? Linguistic differences between translated and non-translated Finnish children’s literature. Literary and Linguistic Computing 18(4). pp. 389–406.


Pym A. (1993). Epistemological problems in translation and its teaching. A seminar for thinking students. Calaceit (Teruel), Caminade. p123


Pym A. (2005). Explaining Explicitation, draft version of a paper to be published in Krisztina Károly ed. New Trends in Translation Studies. In Honour of Kinga Klaudy. Budapest, 2005.


Santos D. (1995). On grammatical translationese. In Koskenniemi, Kimmo (comp.), Short papers presented at the Tenth Scandinavian Conference on Computational Linguistics (Helsinki, 29-30th May 1995). pp. 59-66. Helsinki.


Scott N. (1996). Investigating normalization in literary translation, paper presented at “Looking at language into the millenium” seminar, Dept of English Language, University of Glasgow.


Séguinot C. (1988). Pragmatics and the Explicitation Hypothesis. TTR: Traduction, Terminologie, Rédaction 1(2). pp. 106-114.


Shlesinger M. (1989). Simultaneous Interpretation as a factor in effecting shifts in the position of texts on the oral-literate continuum. MA Thesis, Tel Aviv, TAU.


Shlesinger M. (1995). Shifts in Cohesion in Simultaneous Interpreting. The Translator 1(2) pp. 193-214.


Tirkkonen-Condit S. (2000). In search of translation universals: Non-equivalence or 'unique' items in a corpus text paper presented at Research Models in Translation Studies, UMIST and UCL Manchester.


Tirkkonen-Condit S. (2002). Translationese – a myth or an empirical fact? A study into the linguistic identifiability of translated language. Target 14(2). pp. 207-219.


Toury G. (1991). What are descriptive studies into translation likely to yield apart from isolated descriptions. In K.M. Van Leuwen-Zwart and T. Naaijkens (eds) Translation Studies: The state of the art. Amsterdam, Rodopi.


Toury G. (!995). Descriptive Translation Studies and Beyond. pp.206-268. Benjamins, Amsterdam & Philadelphia.


Vanderauwera R. (1985). Dutch novels translated into English. The transformation of a "Minority" Literature. Amsterdam, Rodopi.


Vinay J.-P. & Dalbernet J. (1958). Stylistique comparee du francais et de l'anglais. Methode de traduction. pp. 8-37. Paris, Didier.

[1]Although the segmentation required by the CAT tool imposed the source part of a translation unit to conform to this definition. In practice, it turns out that in places, some translation units contained more than one sentence in their source part. This does not impair the validity of the study but has to be reported for the sake of transparency and truthfulness.