(Part of) the IATE database can now be downloaded as a massive TBX! (Translator resources)

Translation - art & business » Translator resources »
(Part of) the IATE database can now be downloaded as a massive TBX!
Track this topic

Pages in topic: [1 2 3] >

(Part of) the IATE database can now be downloaded as a massive TBX!

Thread poster: Michael Beijer

Michael Beijer

United Kingdom
Local time: 03:46
Member (2009)
Dutch to English
+ ...

Jul 10, 2014

‘Download IATE

IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date dat... See more

‘Download IATE

IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date data. However, in order to cater for specific needs, you can also download a copy of some of the data contained in IATE.

The download file contains about 8 million terms in 24 official EU languages. It is provided in TermBase eXchange (TBX) format. For further details see: TBXcoreStructV02.dtd, TBXXCS.xcs, tbxxcsdtd.dtd.
The size of the uncompressed file is about 2.2 gigabytes.
For information on the data structure and the data categories included in the download file, please see: IATE Data fields explained
You can download the file by clicking on the link below.
IATE_download_25062014.zip (Publication date: 25/06/2014)

Statistics: The download file contains 1.3 million concepts.’

--------

A quick look at nl-en, and I count over 450,000 entries! And all of them reviewed and validated! Sadly, the definitions are not present, but that is understandable as the multilingual TBX is already 2GB!

-----> http://iate.europa.eu/tbxPageDownload.do ▲ Collapse

Noe Tessmann

Local time: 04:46
English to German
+ ...

Another massive one

Jul 11, 2014

Dear Michael,

thanks for pointing my nose on that massive piece of terminology but it seems indigestable for my PC. I do have free space of 6 GB on my C drive and 25 on my D drive but when trying to import the tbx file I get immediately an error message that there's not enough free spase left.
But MQ is trying to load the file in backgroud but after one hour of no reponse I'll have to cancel the proces.

Is there a way to pre-extract only the language pairs I am really interested in to allegiate the job of MemoQ.

Have a nice week-end

Noe

PS: I put the tbx file now on my USB stick and it seems that in background some sort of import is going on.

[Edited at 2014-07-11 16:12 GMT] ▲ Collapse

Erik Freitag

Germany
Local time: 04:46
Member (2006)
Dutch to German
+ ...

No luck with MultiTerm either

Jul 11, 2014

Dear Michael,

Thanks for the link! I have tried to import the tbx into MultiTerm, but the process is aborted within a couple of seconds with the message "System.OutofMemoryException". I'm using a Win7 PC, 64bit, with 16GB of RAM.

Seeing that MemoQ fails here as well, maybe something's wrong with the tbx?

Has anyone be able to succesfully import the tbx?

Kind regards,
Erik

FarkasAndras

Local time: 04:46
English to Hungarian
+ ...

Nice

Jul 11, 2014

Thanks, Michael. I will look into converting it to some more palatable format / extracting language pairs. It shouldn't be hard to generate a basic tabbed glossary, but there is quite a lot of metadata in there and some of it may be worth conserving. Doing that is trickier of course.

The download fails for me every time... it's supposed to be a 113 MB file, but the download stops early and I'm stuck with a partial file. Someone rehost it in dropbox or something, please.

[Edit... See more

Michael Beijer

United Kingdom
Local time: 03:46
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Xbench is working fine here…

Jul 11, 2014

Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > Properties‘Project > Properties’ (or F2),
then click Add‘Add’,
then select TBX/MARTIFF Glossary,
then Next‘Next’,
then Add File‘Add File’,
select the file,
then Next‘Next’,
and Next‘Next’ again ...

The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear.

Once the file has been imported into your Xbench project, you can export it via Tools > Export items‘Tools > Export items (Ctrl+R)’ … as either:

– a TMX
– a tab-delimited UTF-8 text file, or
– an Excel file

I am converting a number of languages for colleagues over on the CafeTransltors mailing list, but it might be better if Adras did it as he has much more experience with data from such large multilingual projects.

Michael

See also: https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/lAopgfpC1Sw

PS: here is the original TBX I downloaded from the site (zipped and unzipped):

https://www.dropbox.com/s/ck67kppuis7e050/IATE_download_25062014.zip (113.38MB)
https://www.dropbox.com/s/zv5aavl0baq316h/IATE_export_25062014.tbx (2,117.09MB)

As tab del (created with ‘Include segments even if the source or target is missing’ OFF):

en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries)
de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)

src: Download IATE, European Union, 2014.

[Edited at 2014-07-11 13:04 GMT] ▲ Collapse

FarkasAndras

Local time: 04:46
English to Hungarian
+ ...

Xbench

Jul 11, 2014

I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to... See more

Giovanni Guarnieri MITI, MIL

United Kingdom
Local time: 03:46
Member (2004)
English to Italian

yep...

Jul 11, 2014

as Erik, I get the message "System.OutofMemoryException" in Multiterm Convert...

I have a very powerful PC with tons of memory...

Michael Beijer

United Kingdom
Local time: 03:46
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@András:

Jul 11, 2014

FarkasAndras wrote:

I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to conserve those.

[Edited at 2014-07-11 12:33 GMT]

I seem to have managed to conserve the ‘reliabilityCodes’ and the ‘subjectField’ (numbers), but no synonyms or acronyms (using Xbench). I'll have to have a look at the info on the data structure when I have a moment:

http://iate.europa.eu/tbx/IATE%20Data%20Fields%20Explaind.htm
http://www.ttt.org/oscarstandards/tbx/TBXcoreStructV02.dtd
http://iate.europa.eu/downloadXcs.do

• en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries)
• de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)
• nl-en https://www.dropbox.com/s/nmznnfotyuzl1tl/IATE_nl-en-(401,625-entries).txt (401,625-entries)

Michael

[Edited at 2014-07-11 13:26 GMT]

Michael Beijer

United Kingdom
Local time: 03:46
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Heartsome?

Jul 11, 2014

I wonder if any of the now-OS Heartsome tools can handle this TBX better?

Michael

http://www.heartsome.net/en-US/hsde.html
http://www.heartsome.net/en-US/downloads.html

2nl (X)

Netherlands
Local time: 04:46

Thank you for making this possible!

Jul 11, 2014

Michael Beijer wrote:

• de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)

Thanks Michael!

Hans

Tamas Elek

Hungary
Local time: 04:46
English to Hungarian
+ ...

Problem with memoQ

Jul 12, 2014

I simply cannot import the database into memoQ. I was trying to import the Hungarian - English language pair, but after a few hours of processing, it stops with the following message:

Warnings
--------------------------
Line 2, column 2: TBX is not valid against DTD. Details: No DTD found.
Error during TBX validation: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.. Skipping TBX validation.

General error.
TYPE:
System.Xml.XmlException

MESSAGE:
'_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.

SOURCE:
System.Xml

CALL STACK:
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseNumericCharRefInline(Int32 startPos, Boolean expand, StringBuilder internalSubsetBuilder, Int32& charCount, EntityType& entityType)
at System.Xml.XmlTextReaderImpl.ParseCharRefInline(Int32 startPos, Int32& charCount, EntityType& entityType)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at MemoQ.Termbase.TBXImporter`1.readTbxAndGetLanguages(String tbxFilePath, XmlReaderSettings tbxSettings, Boolean collectLangCodes)
at MemoQ.Termbase.TBXImporter`1.checkDTD(Boolean validateXCS, Boolean collectLangCodesFromTBX)
at MemoQ.Termbase.TBXImporter`1.prepare()
at MemoQ.Termbase.GUI.Import.TBXLocalImporterJob.DoJob()
at MemoQ.Common.Job.JobBase.Execute(Object o)

Any idea how to resolve this issue? I tried three times, but it is always the same.

Thank you in advance.

[Edited at 2014-07-12 22:06 GMT] ▲ Collapse

Michael Beijer

United Kingdom
Local time: 03:46
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

multifarious.filkin.com

Jul 13, 2014

Interesting post on Paul Filkin's blog: http://multifarious.filkin.com/2014/07/13/what-a-whopper/

He has found a way to get the data into MultiTerm (apparently with all the metadata intact).

Michael

Giovanni Guarnieri MITI, MIL

United Kingdom
Local time: 03:46
Member (2004)
English to Italian

done!

Jul 13, 2014

Michael Beijer wrote:

Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > Properties‘Project > Properties’ (or F2),
then click Add‘Add’,
then select TBX/MARTIFF Glossary,
then Next‘Next’,
then Add File‘Add File’,
select the file,
then Next‘Next’,
and Next‘Next’ again ...

The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear.

Once the file has been imported into your Xbench project, you can export it via Tools > Export items‘Tools > Export items (Ctrl+R)’ … as either:

– a TMX
– a tab-delimited UTF-8 text file, or
– an Excel file

I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!

RWS Community
United Kingdom
Local time: 04:46
English

You should have clicked on the error!

Jul 13, 2014

Giovanni Guarnieri MITI, MIL wrote:

I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!

Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example.

Regards

Paul

Giovanni Guarnieri MITI, MIL

United Kingdom
Local time: 03:46
Member (2004)
English to Italian

I did!

Jul 13, 2014

SDL Support wrote:

I don't remember seeing any explanation... only the number of entries imported out of the total and the number of entries not imported... maybe I didn't look properly...

Pages in topic: [1 2 3] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maria Castro	[Call to this topic]
Nawal Kramer	[Call to this topic]

You can also contact site staff by submitting a support request »

(Part of) the IATE database can now be downloaded as a massive TBX!

Translation news

» DeepL launches AI-powered sentence editor in Korea
(0 comments)
» Should Language Service Providers Rethink Their Offerings?
(0 comments)
»
(0 comments)

Submit translation news »
Read more translation news »

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers! The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

(Part of) the IATE database can now be downloaded as a massive TBX!

(Part of) the IATE database can now be downloaded as a massive TBX!

You have native languages that can be verified

Your current localization setting

Select a language