6781097401

6781097401



101


Polish Phoneme Statistics Obtained on Large Set of Written Texts

proceeding t\t j. This basie scheme is extended to cover overlapping phonetic contexts. If morę then one result is possible, then longer context is chosen for transcription, which inereases its accuracy. Exceptions are handled by additional tables in the similar manner.

Specific transcription rules were designed by a human expert in an iterative process of testing and updating rules. Text corpora used in design process consisted of various sample texts (newspaper articles) and a few thousand words and phrases including special cases and exceptions.

3.2. Corpora used

Several newspaper articles in Polish were used as input data in our experiment. They are from Rzeczpospolita newspaper from years 1993-2002. They cover mainly political and economic issues, so they contain ąuite many names and places including foreign ones, what may influence the results slightly. In example, q appeared once, even though it does not exist in Polish. In total, 879 megabytes of text, which corresponds to around 110000000 words, were included in the process.

Several hundreds of thousands of Internet articles in Polish madę another corpus. They are all from a high ąuality website, where all content is reviewed and controlled by moderators. They are of encyclopedia type, so they also contain many names including foreign ones. In total, 754 megabytes (around 94000000 words) were included in the process.

The third corpus consists of several literaturę books in Polish. Some of them are translations from other languages, so they also contain foreign words. The corpus includes 490 megabytes (around 61000000 words) of text.

4. Results

The total number of around 1856 900 000 phonemes were analysed. They are grouped into 40 categories (including space). Actually, one morę, namely q, was detected, which appeared in a foreign name. Since q is not a part of the Polish alphabet, it was not included in the phoneme distribution presented in Table 1. Space (noted as #) freąuency was 15.26 %. An average number of phonemes in words is 6.6 including one space. Exactly 1271 different diphones (Fig. 1 and Table 2) for 1560 possible combinations were found, which constitutes 81%.

21961 different triphones (see Table 3) were detected. Combinations like *#*, where * is any phoneme and # is a space were removed. These triples should not be considered as triphones because the first and the second * are in two different words. The list of the most common triphones is presented in Table 3. Assuming 40 different phonemes (including space) and subtracting mentioned *#* combinations, there are 62 479 possible triples. We found 21961 different triphones. It leads to a conclusion that around 35% of possible triples were detected as triphones, the very most of them at least 10 times.



Wyszukiwarka

Podobne podstrony:
99 Polish Phoneme Statistics Obtained on Large Set of Written Texts Table 1 Phonemes in Polish (SAMP
103 Polish Phoneme Statistics Obtained on Large Set of Written Texts Table 2 Most common Polish
105 Polish Phoneme Statistics Obtained on Large Set of Written Texts Triphones    x F
SU PIAN BI N SAMAT AND C.J. EVANS probabllity of obtaining the whole set of n data poilits ),f... yn
essent?rving?77 A P P E N D I X pine is much coarser in texturc with large expanses of casily torn c
00085 ?5c54cc53a0b9e32369adfc9c63114c 84Hurwitz & Mathur factors of complexity. On the other ha
Computer Science • Vol. 10 • 2009 Bartosz Ziółko*, Jakub Gałka*, Mariusz Ziółko*POLISH PHONEME
Polish driver does U-turn on M6 before driving the WRONG WAY down the road (because he didn t h
Then a large mass of mortar and rubble was placed on top of them. Large fiat stones were placed
NEOLITYCZNE GÓRNICTWO NA JAŃSKIEJ GÓRZE 45 by L.Fober and G.Wcisgerber (1980, 32) on the basis of ob
10A PS10 2N 3055 26 V.D.C. Input + VARIABLE OUTPUT f ALL 2N 3055 ARE MOUNTED ON A LARGE HE
258Sylwester Dziki: THE POLISH ACADEMIC AND SPECIALIST PERIODICAL PRESS (ON THE BASIS OF THE SITUATI
2) Resulls The algorithm was run on a set of 287198 address records. The data records are tuples def
SIJMMARY ANO CONCUJS TONS V 86230 fragreentatlon of dlffarant. poi 1 dfts . On ona sidft t.hftrft wa

więcej podobnych podstron