Not enough tags for fuzzy matches Thread poster: Mark van de Velde
|
Hello, I'm new to OmegaT and struggling with the fuzzy matches. In some sentences in my source document almost every word is tagged; in some cases even words have been split in two. In other cases a single tag has been assigned to an entire paragraph. For the paragraph I am not getting any suggestions, of course, although I know that parts of sentences in that paragraph match earlier translations. (An agency provided me with a translation memory.) I checked... See more Hello, I'm new to OmegaT and struggling with the fuzzy matches. In some sentences in my source document almost every word is tagged; in some cases even words have been split in two. In other cases a single tag has been assigned to an entire paragraph. For the paragraph I am not getting any suggestions, of course, although I know that parts of sentences in that paragraph match earlier translations. (An agency provided me with a translation memory.) I checked the source and target language, and they're OK. What should I do to get meaningful fuzzy matches? Thanks, Mark ▲ Collapse | | | Susan Welsh United States Local time: 08:24 Russian to English + ... get rid of the junk tags | Dec 14, 2017 |
These garbage tags come from MS Word .docx format, as I understand it. Use Codezapper or TransTools to clean up the Word file, and then you will probably get matches. | | | Samuel Murray Netherlands Local time: 14:24 Member (2006) English to Afrikaans + ... I agree, try it | Dec 14, 2017 |
Susan Welsh wrote: Use Codezapper ... to clean up the Word file, and then you will probably get matches. Use the CZL macro in CodeZapper -- it often works wonders. It costs EUR 20. TransTools appears to be free. | | |
Thanks, I will try Transtools first. By the way, will I get better tags if I copy-paste the text from the .docx file and save it to another file format, such as .rtf or .txt? | |
|
|
Didier Briel France Local time: 14:24 English to French + ... Use .odt or .txt | Dec 15, 2017 |
markvdvelde wrote: Thanks, I will try Transtools first. By the way, will I get better tags if I copy-paste the text from the .docx file and save it to another file format, such as .rtf or .txt? .rtf is not supported. You will generally have much less tags in .odt (LibreOffice). You will have no tags at all in .txt. Didier | | | Now I am having too few tags | Dec 15, 2017 |
You will generally have much less tags in .odt (LibreOffice). I indeed have far fewer tags in odt. But to get fuzzy matches I do need tags as such, right? So how do I get meaningful tags that break up sentences in the right places? Now entire sentences or paragraphs are tagged with a single tag. For a long sentence I am not getting any suggestions, although I know that parts of sentences in that paragraph match earlier translations. (An agency provided me with a translation memory.) So now I have got the opposite problem... | | | Samuel Murray Netherlands Local time: 14:24 Member (2006) English to Afrikaans + ...
markvdvelde wrote: Will I get better tags if I copy-paste the text from the .docx file and save it to another file format, such as ... .txt? No need for copy/paste -- simply use MS Word itself. Press F12 for the "Save as" dialog, and then select "TXT" in the "Save as type" dropdown list. MS Word will likely prompt you with an encoding dialog: select "Other encoding" and then select either "Unicode" (which is Microsoft's way of saying "UTF-16-LE") or "Unicode (UTF-8)" from the dropdown list. Then test your TXT file in your favourite text editor to see if it looks okay. | | | Susan Welsh United States Local time: 08:24 Russian to English + ... Too few tags?? | Dec 15, 2017 |
markvdvelde wrote: So how do I get meaningful tags that break up sentences in the right places? Now entire sentences or paragraphs are tagged with a single tag. Mark, you seem confused about what tags are and what they're for. They are mainly for things like font changes within a line, and have nothing to do with breaking up paragraphs into segments (I assume that's what you mean by "breaking up sentences in the right places," since you would almost never want to break up a grammatical sentence into more than one segment). It is by no means uncommon to have entire sentences with a single tag, or to have no tags at all in a heavily formatted document like a table. For example, in a document I am now working on, I have this: E-mail: [t0/][email protected] (I am not using the proper symbol for the tag, because otherwise it screws up this message in html.) The [t0/] tag has to do with the fact that the email address is a hyperlink. In the fuzzy match pane, I get this: E-mail: [email protected] - this one has no tag (it's not a hyperlink), so there is only a match of [50/50/57%] - which means basically that the only thing the two segments have in common is the word "E-mail:" and the @ sign. There are no more tags anywhere in my document. The font changes do not occur in the middle of segments, so they are taken care of by OmegaT magic where you don't see them. For example, the formatting for subheads has no tags. You seem to have confused tagging and segmentation. I suggest you look back at the manual or the beginner's CAT tutorial on the OmegaT website. For one thing, you have to decide whether you want paragraph or sentence segmentation (Project > Properties > check or uncheck "Enable sentence-level segmenting." Your sentence that "Now entire sentences or paragraphs are tagged with a single tag" makes no sense. If you are using paragraph segmentation, you will not get matches unless a PARAGRAPH in your TM is very similar to the paragraph in your new document.
[Edited at 2017-12-15 22:30 GMT]
[Edited at 2017-12-15 22:32 GMT] | |
|
|
I meant segments, not tags | Dec 16, 2017 |
You seem to have confused tagging and segmentation. I suggest you look back at the manual or the beginner's CAT tutorial on the OmegaT website. Hi Susan, you're absolutely right. I'm sorry for the confusion. I should have been talking about segments. Yet the problem remains. If I am correct, segments are whole sentences (unless you uncheck "Enable sentence-level segmenting"). Let me give you an example. Let's say that the translation memory contains this sentence: The predefined break rules should be sufficient for most European languages and Japanese. If a to be translated document contains this sentence... The predefined break rules should be sufficient for most European languages and Chinese. ...I won't get a fuzzy match. I'd say that phrases like "predefined break rules" should be considered a segment, but how can I force OmegaT to do this? | | | Didier Briel France Local time: 14:24 English to French + ... You should have a match | Dec 17, 2017 |
markvdvelde wrote: Yet the problem remains. If I am correct, segments are whole sentences (unless you uncheck "Enable sentence-level segmenting"). Let me give you an example. Let's say that the translation memory contains this sentence: The predefined break rules should be sufficient for most European languages and Japanese. If a to be translated document contains this sentence... The predefined break rules should be sufficient for most European languages and Chinese. ...I won't get a fuzzy match. Then there's something wrong (for instance, the translation memory is not actually loaded). With your example, I get a match with 88/92/96% in OmegaT. I'd say that phrases like "predefined break rules" should be considered a segment, but how can I force OmegaT to do this? You don't, because it wouldn't make sense. That would mean you would have to translate The predefined break rules should be sufficient for most European languages and Chinese. as 3 separate segments/sentences. I would recommend focusing on understanding how it should work (have you done the Instant Start tutorial?), and then checking why you don't get matches when you think you should get them. Didier | | | Samuel Murray Netherlands Local time: 14:24 Member (2006) English to Afrikaans + ... @Mark, rereading your initial post | Dec 17, 2017 |
markvdvelde wrote: I'm new to OmegaT and struggling... In some sentences in my source document almost every word is [in a separate segment]; in some cases even words have been split in two. In other cases a single [segment] [is] an entire paragraph. It would seem that the fault lies with your source document's hidden formatting. It would seem that there are hidden codes in your document that cause OmegaT to split up the text in the way you describe. I have seen this before, in scanned or PDF converted documents. In some PDF files, each word is a separate "chunk", and when converted to something like MS Word, each word becomes an individual text box, which is usually reserved for whole chunks of text. This is because the converter program uses the text box feature for positioning each word individually instead of positioning whole paragraphs. Similarly, some OCR functions in scanners put individual words in separate text boxes (and "text box" may not even be the right word... there is something in MS Word called "frames" which is also like a text box but behaves in a different way...). So the document *looks* fine when you view it in the relevant program, but it is actually not an editable file in the usual sense. Could this explain the problem you're having? If you have an OCR program yourself, then print the file, scan it, and OCR it again (this time using a setting that favours editing over positioning). | | | Samuel Murray Netherlands Local time: 14:24 Member (2006) English to Afrikaans + ... @Mark, re "fuzzy matches" and segmentation | Dec 17, 2017 |
markvdvelde wrote: ...I won't get a fuzzy match. The term "fuzzy match" means "similar but not exactly the same". In other words, "The cat sat on the mat" is a fuzzy match for "The dog sat on mat", because it is similar but not exactly the same. One could say the opposite of a fuzzy match is an exact match. I'd say that phrases like "predefined break rules" should be considered a segment, but how can I force OmegaT to do this? Humans define segments as content. OmegaT defines segments as boundaries. In other words, in "[some text here]...is this a dog? This is a cat. This is my cat...[some text here]", OmegaT doesn't know that "This is a cat." is a sentence. OmegaT only knows that "? " and ". " is the start and end of that particular sentence. In other words, OmegaT splits up text into segments based on elements in that text which OmegaT recognises as "segment breaks". If OmegaT sees something that is a segment break, it breaks the text at that point, thereby creating two segments (or one segment, if there is no text before or after the segment break). If you want OmegaT to put "predefined break rules" in a separate segment, you have to figure out a way to tell OmegaT how to break up the text so that "predefined break rules" ends up in a separate segment. And that is not really possible (or feasible).
[Edited at 2017-12-17 11:23 GMT] | |
|
|
Susan Welsh United States Local time: 08:24 Russian to English + ... still confused | Dec 17, 2017 |
I think you are still confused. If you want to send me a bit of your document and the client's TM, I will take a look. It would be best for you to follow Didier's advice first (follow the "Learn to use OmegaT in 5 minutes!" tutorial that is on the start page when you open OmegaT), but I have time to look at your file today and won't during the coming week. Susan | | | There is no moderator assigned specifically to this forum. To report site rules violations or get help, please contact site staff » Not enough tags for fuzzy matches Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
| Wordfast Pro | Translation Memory Software for Any Platform
Exclusive discount for ProZ.com users!
Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |