Not enough tags for fuzzy matches (OmegaT support)

Technical forums » OmegaT support »
Not enough tags for fuzzy matches
Track this topic

Not enough tags for fuzzy matches

Thread poster: Mark van de Velde

Mark van de Velde
Poland
Local time: 14:24
English to Dutch

Dec 14, 2017

Hello,

I'm new to OmegaT and struggling with the fuzzy matches. In some sentences in my source document almost every word is tagged; in some cases even words have been split in two.

In other cases a single tag has been assigned to an entire paragraph. For the paragraph I am not getting any suggestions, of course, although I know that parts of sentences in that paragraph match earlier translations. (An agency provided me with a translation memory.)

I checked... See more

Susan Welsh

United States
Local time: 08:24
Russian to English
+ ...

get rid of the junk tags

Dec 14, 2017

These garbage tags come from MS Word .docx format, as I understand it. Use Codezapper or TransTools to clean up the Word file, and then you will probably get matches.

Samuel Murray

Netherlands
Local time: 14:24
Member (2006)
English to Afrikaans
+ ...

I agree, try it

Dec 14, 2017

Susan Welsh wrote:
Use Codezapper ... to clean up the Word file, and then you will probably get matches.

Use the CZL macro in CodeZapper -- it often works wonders. It costs EUR 20. TransTools appears to be free.

Mark van de Velde
Poland
Local time: 14:24
English to Dutch

TOPIC STARTER

rtf or txt?

Dec 14, 2017

Thanks, I will try Transtools first. By the way, will I get better tags if I copy-paste the text from the .docx file and save it to another file format, such as .rtf or .txt?

Didier Briel

France
Local time: 14:24
English to French
+ ...

Use .odt or .txt

Dec 15, 2017

markvdvelde wrote:
Thanks, I will try Transtools first. By the way, will I get better tags if I copy-paste the text from the .docx file and save it to another file format, such as .rtf or .txt?

.rtf is not supported.
You will generally have much less tags in .odt (LibreOffice).
You will have no tags at all in .txt.

Didier

Mark van de Velde
Poland
Local time: 14:24
English to Dutch

TOPIC STARTER

Now I am having too few tags

Dec 15, 2017

You will generally have much less tags in .odt (LibreOffice).

I indeed have far fewer tags in odt. But to get fuzzy matches I do need tags as such, right? So how do I get meaningful tags that break up sentences in the right places?

Now entire sentences or paragraphs are tagged with a single tag. For a long sentence I am not getting any suggestions, although I know that parts of sentences in that paragraph match earlier translations. (An agency provided me with a translation memory.) So now I have got the opposite problem...

Samuel Murray

Netherlands
Local time: 14:24
Member (2006)
English to Afrikaans
+ ...

@Mark

Dec 15, 2017

markvdvelde wrote:
Will I get better tags if I copy-paste the text from the .docx file and save it to another file format, such as ... .txt?

No need for copy/paste -- simply use MS Word itself. Press F12 for the "Save as" dialog, and then select "TXT" in the "Save as type" dropdown list. MS Word will likely prompt you with an encoding dialog: select "Other encoding" and then select either "Unicode" (which is Microsoft's way of saying "UTF-16-LE") or "Unicode (UTF-8)" from the dropdown list. Then test your TXT file in your favourite text editor to see if it looks okay.

Susan Welsh

United States
Local time: 08:24
Russian to English
+ ...

Too few tags??

Dec 15, 2017

markvdvelde wrote:

So how do I get meaningful tags that break up sentences in the right places?
Now entire sentences or paragraphs are tagged with a single tag.

Mark, you seem confused about what tags are and what they're for. They are mainly for things like font changes within a line, and have nothing to do with breaking up paragraphs into segments (I assume that's what you mean by "breaking up sentences in the right places," since you would almost never want to break up a grammatical sentence into more than one segment). It is by no means uncommon to have entire sentences with a single tag, or to have no tags at all in a heavily formatted document like a table.

For example, in a document I am now working on, I have this:
E-mail: [t0/][email protected] (I am not using the proper symbol for the tag, because otherwise it screws up this message in html.)

The [t0/] tag has to do with the fact that the email address is a hyperlink.

In the fuzzy match pane, I get this: E-mail: [email protected] - this one has no tag (it's not a hyperlink), so there is only a match of [50/50/57%] - which means basically that the only thing the two segments have in common is the word "E-mail:" and the @ sign.

There are no more tags anywhere in my document. The font changes do not occur in the middle of segments, so they are taken care of by OmegaT magic where you don't see them. For example, the formatting for subheads has no tags.

You seem to have confused tagging and segmentation. I suggest you look back at the manual or the beginner's CAT tutorial on the OmegaT website.

For one thing, you have to decide whether you want paragraph or sentence segmentation (Project > Properties > check or uncheck "Enable sentence-level segmenting." Your sentence that "Now entire sentences or paragraphs are tagged with a single tag" makes no sense. If you are using paragraph segmentation, you will not get matches unless a PARAGRAPH in your TM is very similar to the paragraph in your new document.

[Edited at 2017-12-15 22:30 GMT]

[Edited at 2017-12-15 22:32 GMT]

Mark van de Velde
Poland
Local time: 14:24
English to Dutch

TOPIC STARTER

I meant segments, not tags

Dec 16, 2017

You seem to have confused tagging and segmentation. I suggest you look back at the manual or the beginner's CAT tutorial on the OmegaT website.

Hi Susan, you're absolutely right. I'm sorry for the confusion. I should have been talking about segments.

Yet the problem remains. If I am correct, segments are whole sentences (unless you uncheck "Enable sentence-level segmenting").

Let me give you an example. Let's say that the translation memory contains this sentence:

The predefined break rules should be sufficient for most European languages and Japanese.

If a to be translated document contains this sentence...

The predefined break rules should be sufficient for most European languages and Chinese.

...I won't get a fuzzy match.

I'd say that phrases like "predefined break rules" should be considered a segment, but how can I force OmegaT to do this?

Didier Briel

France
Local time: 14:24
English to French
+ ...

You should have a match

Dec 17, 2017

markvdvelde wrote:
Yet the problem remains. If I am correct, segments are whole sentences (unless you uncheck "Enable sentence-level segmenting").

Let me give you an example. Let's say that the translation memory contains this sentence:

The predefined break rules should be sufficient for most European languages and Japanese.

If a to be translated document contains this sentence...

The predefined break rules should be sufficient for most European languages and Chinese.

...I won't get a fuzzy match.

Then there's something wrong (for instance, the translation memory is not actually loaded).
With your example, I get a match with 88/92/96% in OmegaT.

I'd say that phrases like "predefined break rules" should be considered a segment, but how can I force OmegaT to do this?

You don't, because it wouldn't make sense. That would mean you would have to translate
The

predefined break rules

should be sufficient for most European languages and Chinese.

as 3 separate segments/sentences.

I would recommend focusing on understanding how it should work (have you done the Instant Start tutorial?), and then checking why you don't get matches when you think you should get them.

Didier

Samuel Murray

Netherlands
Local time: 14:24
Member (2006)
English to Afrikaans
+ ...

@Mark, rereading your initial post

Dec 17, 2017

markvdvelde wrote:
I'm new to OmegaT and struggling... In some sentences in my source document almost every word is [in a separate segment]; in some cases even words have been split in two. In other cases a single [segment] [is] an entire paragraph.

It would seem that the fault lies with your source document's hidden formatting.

It would seem that there are hidden codes in your document that cause OmegaT to split up the text in the way you describe. I have seen this before, in scanned or PDF converted documents.

In some PDF files, each word is a separate "chunk", and when converted to something like MS Word, each word becomes an individual text box, which is usually reserved for whole chunks of text. This is because the converter program uses the text box feature for positioning each word individually instead of positioning whole paragraphs. Similarly, some OCR functions in scanners put individual words in separate text boxes (and "text box" may not even be the right word... there is something in MS Word called "frames" which is also like a text box but behaves in a different way...).

So the document *looks* fine when you view it in the relevant program, but it is actually not an editable file in the usual sense.

Could this explain the problem you're having? If you have an OCR program yourself, then print the file, scan it, and OCR it again (this time using a setting that favours editing over positioning).

Samuel Murray

Netherlands
Local time: 14:24
Member (2006)
English to Afrikaans
+ ...

@Mark, re "fuzzy matches" and segmentation

Dec 17, 2017

markvdvelde wrote:
...I won't get a fuzzy match.

The term "fuzzy match" means "similar but not exactly the same". In other words, "The cat sat on the mat" is a fuzzy match for "The dog sat on mat", because it is similar but not exactly the same. One could say the opposite of a fuzzy match is an exact match.

I'd say that phrases like "predefined break rules" should be considered a segment, but how can I force OmegaT to do this?

Humans define segments as content. OmegaT defines segments as boundaries. In other words, in "[some text here]...is this a dog? This is a cat. This is my cat...[some text here]", OmegaT doesn't know that "This is a cat." is a sentence. OmegaT only knows that "? " and ". " is the start and end of that particular sentence.

In other words, OmegaT splits up text into segments based on elements in that text which OmegaT recognises as "segment breaks". If OmegaT sees something that is a segment break, it breaks the text at that point, thereby creating two segments (or one segment, if there is no text before or after the segment break).

If you want OmegaT to put "predefined break rules" in a separate segment, you have to figure out a way to tell OmegaT how to break up the text so that "predefined break rules" ends up in a separate segment. And that is not really possible (or feasible).

[Edited at 2017-12-17 11:23 GMT]

Susan Welsh

United States
Local time: 08:24
Russian to English
+ ...

still confused

Dec 17, 2017

I think you are still confused. If you want to send me a bit of your document and the client's TM, I will take a look. It would be best for you to follow Didier's advice first (follow the "Learn to use OmegaT in 5 minutes!" tutorial that is on the start page when you open OmegaT), but I have time to look at your file today and won't during the coming week.

Susan

Login to reply/comment

There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »

Not enough tags for fuzzy matches

Forum rules

Help and orientation

Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers! The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc. More info »

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Not enough tags for fuzzy matches

Not enough tags for fuzzy matches

You have native languages that can be verified

Your current localization setting

Select a language