I am currently trying to build a database of English words and their hyphenations (end-of-line divisions) (en-US, if it matters), and thereby have come across some words which I have found contradicting hyphenations for. If those words were exotic, I would not be wondering about it, but some of them are frequently used. For example:
Germany
: Merriam-Webster - Ger-ma-ny
; Hunspell (which by far is the most dominant spell checker and hyphenator in the open source scene, driving applications like LibreOffice, OpenOffice, Firefox, Thunderbird and the like) - Ger-many
freely
: Merriam-Webster - free-ly
; Hunspell - freely
rapid
: Merriam-Webster - rap-id
; Hunspell - rapid
I have read a lot of articles (most of them on this site) about hyphenation. The general consensus seems to be that we should look up the respective word and its hyphenation in authoritative sources. But what if those sources contradict each other?
Another advice which often was given was that we just should hyphenate between syllables. Since I am not a native English speaker, this is extremely difficult for me. While I would have done it right with Germany
and freely
, I would never have done it right with rapid
(in my world, it would have been ra-pid
).
I always have considered the Oxford English Dictionary to be the most authoritative English dictionary. Imagine my surprise when I saw that they neither show hyphenation nor syllabication. The Wiktionary does show hyphenation, but only for some words; the examples mentioned above, being very common words, are not among them, so it's worthless in this respect.
Could somebody please give me a hint what I should do if two important sources which both can (somehow) be considered authoritative show contradicting hyphenations, and even more important, could somebody please tell me if there is a reliable method to identify words which are suspect in this respect in the first place?
To explain the latter: I am currently using the hunspell data to build my database semi-automatically; otherwise, I couldn't handle it. The hunspell data is the only one I have found to be usable to get the hyphenation of a word quite easily.
As a second step, I would like to be able to identify and separate suspect words, which I then could look up manually in different sources (hoping that only about 5% of the words are suspect).
EDIT 1
As a reaction to one of the comments, I now have found a word where at least 3 characters are left at each side after hyphenation, but where different "authorities" hyphenate differently:
Microsoft Word 2010 hyphenates inconceivable
as in-con-ceiv-a-ble
, where Merriam-Webster has in-con-ceiv-able
.
Another one: Merriam-Webster says cli-ent
, where hunspell says client
, i.e. does not hyphenate that word at all.
EDIT 2
@Hot Licks has pointed out that the dictionaries are showing syllable boundaries, not hyphenation points (if any). However, at least in case of Merriam-Webster, this is the same. From their dictionary API documentation:
... (text = boldface)
HEADWORD
- This is the first bold word in an entry
- contains "syllable" break points (that is,
end-of-line hyphenation points) here indicated
by asterisks, which will translate to raised dot,
{point} in Merriam-Webster font.
- may contain superscript homograph numbers
{h,1}, {h,2}, etc., in the same font (bold)
- single word space after field
Please note the text following the second hyphen. IMHO, that means that each syllable boundary is a hyphenation point, and vice versa.
EDIT 3
I have found more precise information. From Merriam-Webster's guide to pronunciation:
Hyphens are used to separate syllables in pronunciation
transcriptions. [...]
The centered dots in boldface entry words indicate potential
end-of-line division points and not syllabication. [...] As a
result, the hyphens indicating syllable breaks and the centered
dots indicating end-of-line division often do not fall in the same
places.