Machine Translation A Contrastive Linguistics Perspective

Intervention de Monsieur Nabil ALI
MACHINE TRANSLATION: A CONTRASTIVE LINGUISTIC PERSPECTIVE
NABIL ALI
I Background
In 1983, the author was assigned the task of developing the first bilingual Arabic-English home-
computer. This task was involved with the development of a bilingual operating system (MSX-based),
as well as the establishment of a software development unit dedicated to Arabic applications. Due to
the home-orientation of the project, the emphasis was mainly on culture-ware and education-ware.
This placed higher priority on the Natural Language Processing and solving the myriad of problems
associated with Arabic computation. At that time, Arabic was extremely underprivileged in the
computation field, suffering the limitations of a minimal system at a pure character level, and poor
printing and display qualities. Thus, it was necessary to shift to a more developed level, dealing with
larger linguistic units, namely the word, the sentence, and the continuous text. Seeing as to how
English was the most established computation example, we had to draw on its resources and
techniques. Shortly, we were to discover that these techniques are not suitable for Arabic. This is
simply due to the fact that Arabic, compared to English, is much more complex at almost all linguistic
levels (with phonology as the sole exception). At the character level, the complexity lays in the cursive
shape and concatenation of Arabic letters, and above all, these letters are characterized by a high
degree or context sensitivity. By this, it is meant that its appropriate shape is determined by the
surrounding letters. Hence, the author had to implement an automata for the generation of the
appropriate shape of letters depending on their context. At the word level, the morphology of Arabic
(the language subsystem dealing with the structure of words) as it is already known, is the most
sophisticated of all languages world-wide. Therefore, the author had to develop a morphological
processor capable of analyzing any Arabic word into its morphological primitives, as well as
synthesizing the final form of words out of these primitives. Lastly came the syntactical level, which no
doubt proved to be the most difficult, primarily because Arabic is usually written without vowels. In
essence, written Arabic is a quasi-stenographic script, and this results in a severe m�lange of various
ambiguities, which are unprecedented and absent from any other languages. The morphological
ambiguity, due to the absence of vowels, is intermixed with other types of ambiguities, mainly those
associated with word sense, part of - speech, and syntactical structure. For a non-Arabic speaker to
appreciate such a problem, let us assume hypothetically that an English sentence such as " SOME
FIRMS LEND MONEY" is written in the Arabic fashion. The result will be the following string of
consonants: "SM FRMS LND MNY" Each of these consonantal forms may have a set of alternative
vowelized interpretations (figure:1)
Figure 1: morphological ambiguity due to drop of vowels
Thus, any syntactical processor dealing with Arabic text as its input has to primarily disambiguate such
quasi-stenographic script. As a result, an automatic vowelizer became mandatory as a prerequisite for
Arabic computation. To solve this problem, the author has developed a "shallow" automatic
understander of Arabic in order to disambiguate the unvowelized text, as well as to substitute the
missing vowels. This required the achievement of three main computational linguistic tasks: (1) the
development of an Arabic parser; (2) the development of a lexical-semantic processor; (3) the
development of an automatic generator of the vowelized text.
Since parsing techniques developed for English have been proven inadequate for the Arabic
language, both in function and performance, a parsing system based on a multi-level grammar, was
developed and implemented. This system is capable of handling the previously mentioned intermixed
set of ambiguities. The disambiguation mechanism works incrementally at every level of the grammar.
Residual ambiguities are resolved heuristically, resorting to preferential principles working on both
syntactic and semantic levels.
1
The availability of this sophisticated system induced the idea to use it as a generalized model for other
languages. The system was successfully slimmed down to handle English. For example, the
morphological processor was slimmed down to be used as an English stammer (to extract the stem of
inflected word forms), the shallow understander was slimmed down to an efficient English parser, and
the lexical-semantic disambiguated was tailored to handle types of ambiguities encountered in
English.
In general, the developed Arabic-English bi-directional system could be characterized as being
engineering-oriented. The author felt that its rather adhoc approach had to be refined theoretically
through more serious investigation in the field of contrastive linguistics. In this regard, the author has
exploited intensively a basic property of Arabic, namely its non Exotic charactor which places it in an
intermediary position among linguistic extremes, found in other languages.
The paper is intended to present machine translation from a linguistic divergence. First, multi-linguality
will be overviewed contrastively as a preface to a more specific discussion focusing on translation. It
will conclude, with a brief description of the contrastive aspects of the bi-directional, Arabic-English
translation system which is currently being developed.
II Multi-linguality: A Contrastive Perspective.
Language in the present study is viewed as a system comprised of two main components, grammar
and lexicon. Each of the two components will be considered consecutively.
According to Robert Freidin (1), the comparative work carried out by nineteenth-century grammarians,
was concerned with establishing an explanatory basis for the relationships between languages and
groups of languages primarily in terms of a common ancestor. Contemporary comparative grammar, in
contrast, is significantly broader in scope. It is concerned with a theory in grammar that is postulated to
be an innate component of a human mind/brain. In this way, the theory of grammar is a theory of
human language and hence establishes the relationship among all languages, not just those that
happen to be related by historical accident (for instance, via common ancestry).
One can safely say that the current advancements in contrastive linguistics are attributed to the
adoption of the generative paradigm. This paradigm has been applied successfully across the different
language subsystems, mainly morphology, syntax, and semantics. A general theory of generative
morphology, both derivational and inflectional, has been developed (2). It has reached a level of
maturity that has made it possible to be applied cross-linguistically. This led to different attempts to
develop a universal computational morphological system. These universal approaches deal basically
with affixational morphology. Their performance with regard to diffusional (non-concatenative)
morphology is still questionable. Furthermore, the majority of morphological research focus on the
word form much more than they do on the words semantics and information content. Aronoff has
initiated a semantic-based morphology which views the derivation of words as a generative process
from one word sense to another. This shift toward semantic-based morphology is essential for
contrastive analysis at the word level in general, and for translation, in particular.
At the syntactic level, the diversity of languages is based on the "Government Biding" theory
developed by Chomsky. According to GB, a language is not a system of rules, but a system of
specifications for parameters in an invariant system of principles of Universal Grammar (UG).
Linguistic diversity can be explained as variation in the setting of certain values for a principle of UG.
Hence, linguistic variations would, in part, be reduced to parametric variation for principles of UG. An
example of this parameter is that of the "null-subject" that differentiates between languages using
explicit subjects, such as French and English, and those that permit the omission of the subject, such
as Arabic and Spanish. While the GB parameterization-based model is highly theoretically apt, yet due
to its general nature, it is too abstract to be applied as practical computational systems for analysis
(parsing) and synthesis (generation).
As we move towards semantics, linguistic divergence fades out. Although languages usually exhibit
broad disparity at the morphological and syntactical levels, such linguistic disparity is greatly
diminished at the semantic level, at which various syntactical forms are converted to their
corresponding logical forms. Formal logic, is basically universal and thus can transcend linguistic
2
boundaries. At this logical level, the sameness of meaning is explicitly expressed, this in turn, enables
cross-linguistic mapping and transformation. Engineering-oriented approaches to develop a universal
semantic processor which can work multi-lingually have been developed. They adopt a compositional
paradigm which decomposes meaning using a universal set of semantic primitives, A well-known
example of these processors is the conceptual-dependency model developed by Shank.
Regarding the lexicon, and according to James Pustejovsky (3), "Computational and theoretical
linguistics have largely treated the lexicon as a static set of word senses tagged with features for
syntactic morphological and semantic information." This view has undermined the role of the lexicon
within the overall language system. Indeed the pendulum has switched sides to the extent that it has
reached the other extreme of viewing the whole language within the lexicon itself. The notion that
morphology, syntax and semantics are in the lexicons is gaining acceptance. Under this lexicon-based
framework, the major role of these language subsystems is to express formally (rule-based) the lexical
redundancies and regularities encountered among the different lexical entries. Currently, the
generative paradigm is being introduced in the field of the lexicon in an attempt to upgrade the art of
lexicography to the level of exact science of what is now termed "lexicology". MIT has launched a
cross-linguistic lexicon project with the main objective to systematize different aspects of lexical
divergence.
A major challenge facing generative lexicology is that related to metaphors. Since metaphors are
strongly linked to culture, it is difficult to isolate the culture-dependent subparts from those that are
independent. However, Lekov, in his seminal work, Metaphors We Live By, sheds light on how to
tackle the metaphor problem cross-linguistically. By providing us with many examples of commonly
used metaphors that exist in many different languages, he initiates a new approach to perceive
metaphors at a higher universal level.
III Application to the Field of Machine Translation
With the above overview of multi-linguality, we now bring forth our contrastive analysis in the context
of machine translation. This transition could be better visualized if we note that the changing attitude
towards translation studies has always been determined by transformations in the theory of language.
According to Marie-therese Abdel-Messih (4), one could summarize these transformations in terms of
the following milestones:
" Plato s theory of language: which assumes the existence of a fundamental prior utterance with a
determinant literal meaning that can be transformed inter-lingually.
1. Defining translation in terms of sameness: a discourse that privileged the source
language, thus making the corresponding target language as a by-product of the
original.
1. The rejection of the sameness approach a trend which gave way to plural
interpretations that de-stabilized the concept of fidelity. Languages, according to
Benjamin, are not strangers to one another. Rather, they are interrelated in what they
want to express. However, this kinship of languages does not refer to superficial
similarities. Instead, it manifests itself in deeply seated properties which can only be
surfaced through in-depth contrastive investigation.
1. The rejection of the concept of universal language that could be reached through an
accurate translation. Jacques Derrida re-interpreted the art of reading in terms of
growth of language whereby translation supplements one language with what the
other lacks.
In the field of machine translation, "the kinship of languages" has different realizations. We
can distinguish three kinds of multi-lingual systems that can translate from one set of
languages to another. These are the inter-lingua translation model, the parameterization
model, and the
3
isomorphic grammar model (figure 2). Here, the former two will be discussed while the latter
will be dealt with separately in the coming section.
Figure 2: Machine translation models
(a) The Inter-Lingua model: To recognize the significance of the inter-lingua machine
translation model, one may refer to the much more simplistic transfer scheme used to
translate mono-directionally from a single language to the other. Generally speaking, this
scheme abides by the previously mentioned "sameness" principle, whereby the generated
translated text is defined by the grammar of the source language and the transfer subsystem
which converts it to the corresponding target language representation. For each language
pair, there is a specific transfer module. This is definitely a tough requirement which makes
the transfer scheme extremely impractical in case of multilingual bi-directional automatic
translators. The inter-lingua came as a solution to this problem. In this model, a grammar of
the source language defines an analysis component which translates from the source
language into an intermediate language known as the inter-lingua. A grammar of the target
language defines a generation component which translates from the inter-lingua to the target
language (6). By this, instead of using different transfer modules for each language pair, the
inter-ligua works as a multi-lingual transfer module. A major drawback in the inter-lingual
model is that it requires for each member language specific analysis and synthesis
components.
(b) The Parameterization Model: The Parameterization Model is an enhancement to the inter-
lingua model by overcoming its major drawback just mentioned. The approach to translation is
similar to the traditional approach in that it uses a language independent representation.
Nevertheless, in this approach there are only one analysis component (multi-lingual parser)
and one synthesis which work multi-lineally. Language-specific information is factored out in
terms of parameter settings and the translation mapping operates uniformly across all
4
languages. The UNITRAN system developed by Dorr is one implementation of this model. It
uses parameterization in both the syntactic and lexical distinctions (5). The parameter setting
approach is desirable from a number of different perspectives. First, it allows language-
specific knowledge to be represented independently from the language included in the
syntactical principles and the inter-lingual representation. Second, it appeases the contrastive
aspects. Third, the parameter setting approach allows a machine-translation system to be
easily modified and augmented. As previously mentioned, the parameterization-based
machine translator model is yet too abstract, and this fact indeed does renders its
implementation a difficult task.
IV Isomorphic Grammar Approach
In developing his bi-directional Arabic-English machine translation system, the author has
adopted the isomorphic grammar approach whereby the grammar pair of both the source and
target languages are attuned to one another. This differs from the inter-lingual models in which
grammars for different languages can be developed independently. Isomorphism has been
established through different lexical and grammatical development arrangements. These are:
2. The same grammar model is used for the language pair. In our system, the
Generalized Phrase-Structure Grammar (GPSG) developed by Gazder has been
chosen.
3. Common guidelines for grammar writing with regard to categorization and
subcategorization, as well as sequence of syntactical constructions and derivations.
4. Same lexicalization principles which guarantee the compatibility of the detailed
parts of- speech between the source language and the target language.
5. Identical verbal argumentation scheme (semantic valence patterns).
6. Same features system to support the semantic compositional analysis of meaning,
and the specification of morph-syntactical properties.
7. Same notational system for both grammar and lexicon.
Bidirectionality of the Arabic-English translation system has rendered its syntactical and lexical
transfer components much more sophisticated than that usually found in mono-directional
machine translation systems-main reason is that both members of the language pair could act
as a source language. Hence, the translation system does not have the luxury of subletting
the linguistic courage, either lexically or syntactically.
Writing isomorphic grammars for different languages, particularly those that are members of
different language families such as the case of Arabic and English (Arabic belonging to the
Semitic family and English to the Indo-European) is considered difficult. However, through the
experience of the author in writing grammars, for both languages, the case has proved less
difficult than expected. Isomorphism emerges naturally, provided that the style of grammar
writing abides with a set of well-established principles. Needless to say, perfect isomorphism
can not be attained, and this is due to genuine differences between languages. Regarding the
Arabic-English language pair, major differences could be summarized as follows:
8. Word formation: English word formation is mainly of offixational nature. Arabic
combines both offixational and diffusional methods of word formation.
9. Difference in basic word order: while English is categorized as SVO (S: subject V:
verb O:object), Arabic is basically VSO. Arabic also allows for nominal sentential
patterns initialized by the subject.
10. Null-subject parameter: where English uses explicit subject that can be dropped in
Arabic.
11. Syntactical Flexibility: English syntax follows strict word order. Arabic syntax is more
flexile.
12. Use of pronouns: Unlike English, Arabic does not allow for stranded prepositions
resumptive pronouns are used instead to construct a full prepositional phrase.
5
1. Distinction due to pre/post adjectival modification: in English, the adjective precedes
its modified noun and in Arabic, the case is vice-versa.
2. Distinction due to pre/post genitive construction: English uses post genitive
construction (X sY), while in Arabic the genitive s head precedes its compliment.
(XY)
3. Dropping of the modified noun: Arabic tends to drop the modified noun in case of
abstract or rational nouns. For example, "the wise" and "the clear" imply the "wise
man" and "the clear object". While this elliptic constructions is extremely productive in
Arabic, it is highly restricted in English and is used in limited cases such as, "the rich",
"the blind" or "the disabled".
4. Relatevization: While English relatives definite and indefinite nouns, Arabic restricts
relatevization to definite nouns only.
5. Punctuation: Punctuation in English follows strict rules. On the other hand, Arabic
punctuation is much more flexible, and its usage is rather discretional.
In brief, the isomorphic Arabic-English translator is data-intensive and the transfer model has
been reduced to a direct mapper between the grammar pair of the sconce and target
languages. Lack of isomorphism, due to difference in style of grammar writing, could be
resolved via explosion and implosion of syntactical categories at both sides of the language
pair. This is achieved via what is known technically as recursive ascend or descend along the
grammar rules. In this isomorphic model, the same grammar is used for both analysis
(parsing) and synthesis (generation). Constraints for applying the different grammatical rules
are interpreted differently by the parser and the generator. For instance, while the verb-subject
agreement is considered during parsing a condition for well-formalness of the input sentence,
the same agreement is used to generate the inflectional features or the appropriate forms of
the verb and subject.
V Some Requirements for the Globalization of Machine Translation Technology
To support the globalization of the machine technology, the author recommends the
following:
1. The development of a universal meta-language for both grammar and lexicon.
2. The development of automatic tools for conversion between different grammar
models.
3. The standardization of lexical organization and content.
4. The development of translation-oriented multi-lingual textual corpus.
5. More importance has to be given to textual contrastive linguistics.
6. The encouragement of research works in the field of contrastive computational
linguistics.
References
7. Freidin, Robert. 1992. Principles and Parameters in Comparative Grammar.
Cambridge: MIT Press, pp 1 6.
8. Aronoff, Mark. 1994. Morphology by Itself: Stems and Inflectional Classes.
Cambridge: MIT Press.
9. Pustejovsky, James. 1995. The Generative Lexicon. Cambridge: MIT Press, pp 105
131.
10. Abdel-Messih, Marie Therese. "Translation as a Cross-Cultural Discourse" 1999.
Proceedings: The Fifth International Symposium on Comparative Literature. Cairo:
Faculty of Arts press, pp 283 313.
11. Dorr, Bonnie Jean. 1993. Machine Translation: A View From The Lexicon. Cambridge:
MIT Press, pp 2 22.
6
12. Landsbergen, Jan."Isomorphic grammars and their use in the Rosetta translation
system." In Machine Translation Today. Ed: Margaret King. 1987. Edinburgh:
Edinburgh University Press, pp 351 372.
7

Wyszukiwarka

Podobne podstrony:
Using Linguistic Annotations in Statistical MAchine Translation of Film Subtitles
Cross linguitic Awareness A New Role for Contrastive Analysis
translate p
Simple State Machine Documentation
making vise clamps on the milling machine
The Time Machine Wehikuł czasu FullHD 1080p DTS AC3 5 1
06 Narracja, perspektywa i instancja narracyjna
Ćwiczenia translacyjne
Perspektywa wnętrz
Cnc Lathe Machining
Marian Machinek Rezygnacja z terapii uporczywej oraz tzw testament życia
04?dress Translation
Dossier Tomasz Kijewski Perspektywy wykorzystania biopaliw w kontekscie?zpieczenstwa energetycznego
Szkoła środowisko trudne Z perspektywy psychologa szkolnego
Online Cash Machine Cheat Sheet

więcej podobnych podstron