Pages in topic: [1 2 3] > | (Part of) the IATE database can now be downloaded as a massive TBX! Thread poster: Michael Beijer
| Michael Beijer United Kingdom Local time: 03:46 Member (2009) Dutch to English + ...
‘Download IATE IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date dat ... See more ‘Download IATE IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date data. However, in order to cater for specific needs, you can also download a copy of some of the data contained in IATE. The download file contains about 8 million terms in 24 official EU languages. It is provided in TermBase eXchange (TBX) format. For further details see: TBXcoreStructV02.dtd, TBXXCS.xcs, tbxxcsdtd.dtd. The size of the uncompressed file is about 2.2 gigabytes. For information on the data structure and the data categories included in the download file, please see: IATE Data fields explained You can download the file by clicking on the link below. IATE_download_25062014.zip (Publication date: 25/06/2014) Statistics: The download file contains 1.3 million concepts.’ -------- A quick look at nl-en, and I count over 450,000 entries! And all of them reviewed and validated! Sadly, the definitions are not present, but that is understandable as the multilingual TBX is already 2GB! -----> http://iate.europa.eu/tbxPageDownload.do ▲ Collapse | | | Another massive one | Jul 11, 2014 |
Dear Michael, thanks for pointing my nose on that massive piece of terminology but it seems indigestable for my PC. I do have free space of 6 GB on my C drive and 25 on my D drive but when trying to import the tbx file I get immediately an error message that there's not enough free spase left. But MQ is trying to load the file in backgroud but after one hour of no reponse I'll have to cancel the proces. Is there a way to pre-extract only the language pairs I am re... See more Dear Michael, thanks for pointing my nose on that massive piece of terminology but it seems indigestable for my PC. I do have free space of 6 GB on my C drive and 25 on my D drive but when trying to import the tbx file I get immediately an error message that there's not enough free spase left. But MQ is trying to load the file in backgroud but after one hour of no reponse I'll have to cancel the proces. Is there a way to pre-extract only the language pairs I am really interested in to allegiate the job of MemoQ. Have a nice week-end Noe PS: I put the tbx file now on my USB stick and it seems that in background some sort of import is going on.
[Edited at 2014-07-11 16:12 GMT] ▲ Collapse | | | Erik Freitag Germany Local time: 04:46 Member (2006) Dutch to German + ... No luck with MultiTerm either | Jul 11, 2014 |
Dear Michael, Thanks for the link! I have tried to import the tbx into MultiTerm, but the process is aborted within a couple of seconds with the message "System.OutofMemoryException". I'm using a Win7 PC, 64bit, with 16GB of RAM. Seeing that MemoQ fails here as well, maybe something's wrong with the tbx? Has anyone be able to succesfully import the tbx? Kind regards, Erik | | |
Thanks, Michael. I will look into converting it to some more palatable format / extracting language pairs. It shouldn't be hard to generate a basic tabbed glossary, but there is quite a lot of metadata in there and some of it may be worth conserving. Doing that is trickier of course. The download fails for me every time... it's supposed to be a 113 MB file, but the download stops early and I'm stuck with a partial file. Someone rehost it in dropbox or something, please.
[Edit... See more Thanks, Michael. I will look into converting it to some more palatable format / extracting language pairs. It shouldn't be hard to generate a basic tabbed glossary, but there is quite a lot of metadata in there and some of it may be worth conserving. Doing that is trickier of course. The download fails for me every time... it's supposed to be a 113 MB file, but the download stops early and I'm stuck with a partial file. Someone rehost it in dropbox or something, please.
[Edited at 2014-07-11 09:32 GMT] ▲ Collapse | |
|
|
Michael Beijer United Kingdom Local time: 03:46 Member (2009) Dutch to English + ... TOPIC STARTER Xbench is working fine here… | Jul 11, 2014 |
Hi everyone, I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM. Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first): Project > Properties‘Project > Properties’ (or F2), then click Add‘Add’, then select TBX/MARTIFF Glossary, then Next‘Next’,... See more Hi everyone, I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM. Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first): Project > Properties‘Project > Properties’ (or F2), then click Add‘Add’, then select TBX/MARTIFF Glossary, then Next‘Next’, then Add File‘Add File’, select the file, then Next‘Next’, and Next‘Next’ again ... The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear. Once the file has been imported into your Xbench project, you can export it via Tools > Export items‘Tools > Export items (Ctrl+R)’ … as either: – a TMX – a tab-delimited UTF-8 text file, or – an Excel file I am converting a number of languages for colleagues over on the CafeTransltors mailing list, but it might be better if Adras did it as he has much more experience with data from such large multilingual projects. Michael See also: https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/lAopgfpC1Sw PS: here is the original TBX I downloaded from the site (zipped and unzipped): https://www.dropbox.com/s/ck67kppuis7e050/IATE_download_25062014.zip (113.38MB) https://www.dropbox.com/s/zv5aavl0baq316h/IATE_export_25062014.tbx (2,117.09MB) As tab del (created with ‘Include segments even if the source or target is missing’ OFF): en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries) de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries) src: Download IATE, European Union, 2014.
[Edited at 2014-07-11 13:04 GMT] ▲ Collapse | | |
I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to... See more I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to conserve those.
[Edited at 2014-07-11 12:33 GMT] ▲ Collapse | | |
as Erik, I get the message "System.OutofMemoryException" in Multiterm Convert... I have a very powerful PC with tons of memory... | | | Michael Beijer United Kingdom Local time: 03:46 Member (2009) Dutch to English + ... TOPIC STARTER
|
|
Michael Beijer United Kingdom Local time: 03:46 Member (2009) Dutch to English + ... TOPIC STARTER | 2nl (X) Netherlands Local time: 04:46 Thank you for making this possible! | Jul 11, 2014 |
Thanks Michael! Hans | | | Tamas Elek Hungary Local time: 04:46 English to Hungarian + ... Problem with memoQ | Jul 12, 2014 |
I simply cannot import the database into memoQ. I was trying to import the Hungarian - English language pair, but after a few hours of processing, it stops with the following message: Warnings -------------------------- Line 2, column 2: TBX is not valid against DTD. Details: No DTD found. Error during TBX validation: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.. Skipping TBX validation. General error. ... See more I simply cannot import the database into memoQ. I was trying to import the Hungarian - English language pair, but after a few hours of processing, it stops with the following message: Warnings -------------------------- Line 2, column 2: TBX is not valid against DTD. Details: No DTD found. Error during TBX validation: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.. Skipping TBX validation. General error. TYPE: System.Xml.XmlException MESSAGE: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47. SOURCE: System.Xml CALL STACK: at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args) at System.Xml.XmlTextReaderImpl.ParseNumericCharRefInline(Int32 startPos, Boolean expand, StringBuilder internalSubsetBuilder, Int32& charCount, EntityType& entityType) at System.Xml.XmlTextReaderImpl.ParseCharRefInline(Int32 startPos, Int32& charCount, EntityType& entityType) at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars) at System.Xml.XmlTextReaderImpl.ParseText() at System.Xml.XmlTextReaderImpl.ParseElementContent() at MemoQ.Termbase.TBXImporter`1.readTbxAndGetLanguages(String tbxFilePath, XmlReaderSettings tbxSettings, Boolean collectLangCodes) at MemoQ.Termbase.TBXImporter`1.checkDTD(Boolean validateXCS, Boolean collectLangCodesFromTBX) at MemoQ.Termbase.TBXImporter`1.prepare() at MemoQ.Termbase.GUI.Import.TBXLocalImporterJob.DoJob() at MemoQ.Common.Job.JobBase.Execute(Object o) Any idea how to resolve this issue? I tried three times, but it is always the same. Thank you in advance.
[Edited at 2014-07-12 22:06 GMT] ▲ Collapse | | | Michael Beijer United Kingdom Local time: 03:46 Member (2009) Dutch to English + ... TOPIC STARTER
|
|
Michael Beijer wrote: Hi everyone, I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM. Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first): Project > Properties‘Project > Properties’ (or F2), then click Add‘Add’, then select TBX/MARTIFF Glossary, then Next‘Next’, then Add File‘Add File’, select the file, then Next‘Next’, and Next‘Next’ again ... The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear. Once the file has been imported into your Xbench project, you can export it via Tools > Export items‘Tools > Export items (Ctrl+R)’ … as either: – a TMX – a tab-delimited UTF-8 text file, or – an Excel file I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why! | | | You should have clicked on the error! | Jul 13, 2014 |
Giovanni Guarnieri MITI, MIL wrote: I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why! Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example. Regards Paul | | |
SDL Support wrote: Giovanni Guarnieri MITI, MIL wrote: I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why! Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example. Regards Paul I don't remember seeing any explanation... only the number of entries imported out of the total and the number of entries not imported... maybe I didn't look properly... | | | Pages in topic: [1 2 3] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » (Part of) the IATE database can now be downloaded as a massive TBX! Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
| Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |