multifarious
Just a random collection of things I thought it might be useful to know….
So how many words do you think it is?
It’s not unusual for people to see the word-count in a translation tool, compare it to the word-count in their authoring tool (usually MSWord..!) and then wonder why there are so many differences. Quite often I get sent documents and am asked to explain the differences which I often put down to a simple explanation; ultimately I think you need to be fair with this and Word is a simple word count, whereas a translation tool is designed to try and reflect the effort of the Translator. Studio even separates out the placeables and numbers so that the real effort becomes apparent and a fair way of measuring and charging for the work is achieved. How successful this is often depends on who’s asking and exactly what the source material is..!
To get an idea where some of the differences lie and to see exactly how Studio does its count I have consolidated a few things I’ve shared with others in the past (thank you for asking, Kevin Lossner), stolen a few things from the work of others (thank you for your great article Tuomas Kostiainen) and have cobbled a few things together myself. So I created a word document that looks like this:
I then analysed this in Studio, Trados 2007 and MSWord. The results were as follows:
There’s not a lot of difference here is there? The truth is that more often than not a quick check will reveal something similar and no-one’s worried. But the text above contains things that do make a difference and if the text you are translating contains a lot of something that affects the count then this is when the conversations about who’s right begin.
But before we investigate the differences let’s take a look at the Studio count itself. Studio counts tags and placeables, compared to Trados 2007 that counted some placeables and tags but didn’t distinguish between them… and Trados 2007 sometimes incorrectly counts them. So in this file if you take all the separate segments and add them up the total placeables was seven and not the four shown here… an important contributory factor when comparing the analysis between these tools but not something worth investigating because it will never be fixed..! The basic explanation for what Studio is counting can be generally explained using the first four segments of this example:
Taking it segment by segment the analysis is like this:
#1 : 5 words
#2 : 7 words, with one placeable that is counted in the wordcount
#3 : 7 words, with two tags. The tags are also placeables and are not included in the wordcount
#4 : 9 words, with four placeables each of which are counted as a single word
#2 : 7 words, with one placeable that is counted in the wordcount
#3 : 7 words, with two tags. The tags are also placeables and are not included in the wordcount
#4 : 9 words, with four placeables each of which are counted as a single word
Segment #4 is the active segment showing the blue underlined placeables that are recognised by Studio. So the analysis of this part alone is reported as follows:
The conclusion being:
- all placeables are counted as words unless they are also tags
- tags themselves can be identified for any manual adjustment to the overall rate for tag handling
- numbers themselves cannot be identified from the analysis (subtract the tags and you are left with 5 placeables… one number, two dates and two variables)
A complication worth noting is that if the placeables are not recognised because they are formatted in a way that Studio does not identify with then this also affects the analysis. So for example the same file analysed without these being recognised could generate the following:
This is based on the long date format no longer being recognised and therefore the wordcount has increased by three:
This is actually the same as we see in MSWord now because Word doesn’t see “placeables” and counts this segment as 12 words. So Studio sees this one as 12 words with 5 placeables. You can see how Studio expects to see the formatting using the National Language Support (NLS) API Reference from MicroSoft. You change the operating system to suit the one you are using and the language code details are changed accordingly. Find the language you’re interested in and then click on the link like this:
So in addition to understanding the importance of the source formatting when analysing a file we can also see the first difference between the tools.
Numbers and Dates
Today is Monday, August 8, 2011, aka 08/08/2011 in US and UK.
- (9 words, 4 placeables) Studio counts numbers and full dates when they are recognised
- (9 words, 1 placeable) Trados 2007 does not count numbers and only recognises some date formats (the long date above is not recognised)
- (12 words) MSWord counts numbers and recognises some dates when counting (the long date above is not recognised)
Hyphenated Words
He has a devil-may-care attitude when it comes to hyphenated words
- (11 words) Studio counts hyphenated words as one word
- (13 words) Trados counts the individual words separated by the hyphens
- (11 words) MSWord counts hyphenated words as one word
Forward slash and back slash
Using the solidus for he/she/it is often discouraged, except in this case. c:\Users\pfilkin\Documents\Studio 2011\Projects87 - Joomla ini files \gd-GB\en-AU.tpl_atomic.sys.ini.sdlxliff is handled similarly.
- (31 words, 2 placeables) Studio counts words separated by either “slash” as separate words
- (26 words) Trados counts words separated by either “slash” as separate words but doesn’t count the numbers (it’s also inconsistent in its approach so justifying the 26 is an interesting challenge… feel free to post your answers below..!)
- (21 words) MSWord treats words separated by either “slash” as a single word
Dotted lines (dashed really…) and underscores
Dotted lines count --------- as a word per line but solid underscore lines are ignored ________.
- (16 words) Studio counts both the dotted line and the underscore line as a single word
- (15 words) Trados counts the underscore line as a word but ignores the dotted line (so the opposite of the text above… my initial assumption)
- (16 words) MSWord treats both the dotted line and the underscore line as a single word
You do need to pay attention to this though because if the lines are on their own, and not with a sentence as shown here then Trados won’t count them at all. Studio and MSWord will treat them consistently, also counting individual dashes or underscores separated by spaces as separate words.
Hyperlinks
You can find anything with { HYPERLINK "http://www.google.co.uk" }.
- (7 words, 3 placeables, 2 tags) Studio treats a hyperlink as a single placeable but also separates the hyperlink from the link text so both are counted
- (8 words, 2 placeables) Trados also treats the hyperlink as a word, but 2 words (in a docx)
- (6 words) MSWord treats the entire hyperlink as a single word
One of the differences with hyperlinks is that Studio will separate out the link from the link text making it obvious and also allowing the translation of the link itself if required. The two tags and the link in #10 accounting for the three placeables:
Trados does the same thing but splits the hyperlink itself into two words because of the colon and because the entire link is not handled as a placeable as in Studio. This can vary with the file type as well because DOC and DOCX are treated differently in Trados. The same text as a DOC returns this analysis because of the inconsistent way Trados 2007 treats hyperlinks between file types:
- (7 words, 3 placeables, 2 tags) Studio counts as before
- (6 words, 1 placeables) Trados doesn’t count the hyperlink at all
- (6 words) MSWord counts as before
Numbered lists
1. If you put 2. things into a numbered list 3. the numbers are excluded 4. unless you do them manually 5. like these last two
- (23 words, 2 placeables) Studio moves the correctly formatted numbers (automatic numbering in MSWord) outside the segment and only counts the 4. and 5. because these are placeables in the segment.
- (21 words) Trados does a similar thing but then ignores the numbers that are in the text as part of the count.
- (26 words) MSWord treats the manual and automatic numbering in the same way and they are all counted.
In Studio these actually look like this:
So I think Studio reflects the effort required more accurately than Trados or MSWord.
Numbers and units
50% is written correctly and 50 % is not.
- (8 words, 2 placeables) Studio counts to number and the percentage as a single placeable
- (8 words) Trados ignores the number on its own and counts 50% as one word
- (9 words) MSWord treats numbers and the separated percentage as separate words
In this example 50% and 50 % are both considered correct ways to write percentages with the language pair I have used. This may not always be the case and sometimes the analysis in Studio will vary if the formatting of the numbers do not match that which is expected.
Hidden text… using MSWord hidden text feature
If I hide some of the text the count will be funny.
- (9 words, 1 placeable, 1 tag) Studio ignores the hidden text but does display it as a tag which doesn’t contribute to the word count total
- (9 words, 1 placeable) Trados behaves in a similar way but doesn’t distinguish between a tag and another form of placeable
- (9 words) MSWord doesn’t count hidden text at all
The red text is hidden using the “Hidden text” font feature in MSWord rather than the special non-translatable styles that can be applied to text in Word.
Chemical names
Chemical names are tricky, like (2R)-2-methylsulfanyl-3- hydroxybutanedioate
- (8 words, 1 placeables) Studio counts -2 as a placeable and so separates the chemical name itself into three words for the count due to the hyphens
- (10 words) Trados counts the hyphenated words separately and because of the number recognition breaks up the chemical name around the 3… I think
- (6 words) MSWord treats the chemical name as one word consistently handling hyphenated words as a single word
This is actually a good example of where none of the applications do a good job here because the effort involved in writing these things is not recognised at all. If they are added to the variable list, or a termbase then the effort will be much reduced but for new text this is definitely a good example of a problem area for all applications.
Writing styles
Using an ellipsis … causes different counts…depending on the style . . . that you use.
- (12 words) Studio ignores the ellipsis unless there are no spaces, in which case it treats this as a hyphenated word and treats the word…word as one
- (13 words) Trados counts all the word separately ignoring the ellipsis.
- (16 words) MSWord treats the first ellipsis as a word and the last three dots of the final ellipsis as separate words… it treated the middle ellipsis as a hyphenated word as Studio.
There is no single rule for using an ellipsis… as far as I know. So some texts ask for a space before and after the dots, others say there should be none and I found one guide that asked for a space before and after each dot. I use a space after the dots only (looks like my bad!)… but all these differences are handled in different ways and also lead to analysis inconsistencies.
The other interesting thing about this is segmentation… in Studio I see this:
So the number of segments may vary too depending on the style used and as a result the effort required to merge the segments as you work. They are not paragraph segments so they can be merged, and you could also create a segmentation rule to handle this automatically, but it’s still a good example of how important the authoring styles can be to productivity.
Dotted lines
……………………………………………………………………
- (0 words) Studio only counts the characters and not the words. However it still displays this as a segment that you have to handle.
- (0 words) Trados handles the count the same way but doesn’t make this a translatable segment.
- (1 word) MSWord treats this as a single word
Conclusion
After writing this, and after looking at various examples of texts I have received to investigate since the launch of Studio it is clear there is no single simple answer to why the counts differ. Each text needs to be considered on it’s own merit and often the reasons for the differences are clear. These differences are not only there when considering the differences between Studio, Trados 2007 and MSWord… they will also be there when comparing the counts between other translation tools as well.
All in all I think I agree with the conclusion drawn by Tuomas Kostiainen that Studio makes a good attempt to be fair and to reflect the effort made by the translator in having to handle the work. Even when placeables are not recognised the analysis is reflective of the effort. Studio is also consistent between file types and manually verifying the counts on a segment by segment basis is simple compared to Trados.
Hopefully this explanation of the Studio analysis and consolidation of the differences in counts between tools will help to explain why things are different the next time you are asked why. It may not help to resolve how you are actually paid and this is where the difficulties come from… and is another, less technical reason, why vendors often prefer translators to use the same tools throughout the supply chain. There isn’t an easy answer to a dispute over counts, only fair and sensible compromise… although you could consider an independent tool for counting so that irrespective of the tools used for translation the counts are always consistent. I don’t have an opinion on the merits of these tools but I do know some translators who use them… tools likePractiCount for example… but even here these tools are probably not as comprehensive in terms of file support as your translation tool. is.....
6 comments
The latter would be part of a bigger solution but GoAnalyze is great.
I find that the biggest difference in word counts between Studio and Trados 2007 is in files with number-only segments (Studio counts them and Trados 2007 doesn’t). I know of one agency that uses the Trados 2007 log report to calculate job price and then sends me a Studio package. In one case this meant a difference of 15,000 words in a 90,000 word project. The excuse used to be that Studio couldn’t export its analysis. With the OpenExchange Export Analysis Report and later the same feature within Studio the problem’s been solved, but people should be aware of this potentially big difference between Trados 2007 and Studio word counts.
Translating number-only segments needs time and care, even with auto-propagation enabled. It’s good that Studio takes this into account.
Emma