Talk:Soundex
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||
|
Soundex may be used for indexing words
[edit]Soundex may be used for indexing words, but it was specifically designed for names, and doesn't always apply well to much else. RossPatterson 21:39, 24 September 2005 (UTC)
- Soundex was specifically designed to index names of Western European origin, and not much else. This was a huge bias on the inventor(s) part since they were American, and in the 18th/19th centuries the vast majority of Americans had surnames of Western European origin. This is not necessarily the case today. PS - Can someone please demonstrate that Magaret Odell had anything to do with this patent? I don't see where she appears anywhere on the patent as a patent holder and I can't find any information on her anywhere... Thx. 12.110.196.19 15:48, 5 April 2006 (UTC)
- Knuth describes Soundex in volume 3 (p. 394 in the second edition) as: "... a technique that was originally developed by Margaret K. Odell and Robert C. Russell [see U.S. Patents 1261167 (1918), 1435663 (1922)], ...", and that's good enough for me. I expect just about every reference you'll find on the web goes back to Knuth. It's certainly true that only Russell's name is on those two patents, but I see on re-reading that neither Knuth nor this article say that Odell held the patents. RossPatterson 02:00, 6 April 2006 (UTC)
- This page said Odell was a co-inventory back in the March revision. I see someone updated the page to reflect that Daitch-Mokotoff returns up to 32 separate encodings. The range of encodings, however, is actually 000000 to 999999; although many numeric combinations in this range will never be encountered because of restrictions on side-by-side duplicate digits in the rules for the algorithm.69.116.243.218 02:20, 22 July 2006 (UTC)
- Knuth describes Soundex in volume 3 (p. 394 in the second edition) as: "... a technique that was originally developed by Margaret K. Odell and Robert C. Russell [see U.S. Patents 1261167 (1918), 1435663 (1922)], ...", and that's good enough for me. I expect just about every reference you'll find on the web goes back to Knuth. It's certainly true that only Russell's name is on those two patents, but I see on re-reading that neither Knuth nor this article say that Odell held the patents. RossPatterson 02:00, 6 April 2006 (UTC)
C Code removal
[edit]I'm sorry Sudipta, because I do believe you were acting in perfectly good faith and put plenty of effort into your C implementation of the algorithm, but unfortunately incorporating your own work contravenes the No Original Research standard so I removed it. I hope you understand. Fortunately there are plenty of avenues on the internet where publishing original code is positively encouraged, so maybe your implementation can find a public there? --VinceBowdren 22:41, 22 May 2007 (UTC)
Better algorithm?
[edit]This description doesn't seem very good - at least it's not possible to just convert the steps to code on-by-one as they're written here. The main problem is that step 2 tells you to discard the vowels, but then step 4 refers to the original string. Is there a better algorithm somewhere? Interplanet Janet 16:18, 25 September 2007 (UTC)
- One modified algorithm may work fine for you:
- Retain the first letter of the string
- (amecican census version only): remove any H or W unless it is the first letter
- Assign numbers to letters (after the first) as follows:
- b, f, p, v = 1
- c, g, j, k, q, s, x, z = 2
- d, t = 3
- l = 4
- m, n = 5
- r = 6
- a, e, h, i, o, u, w, y = 0
- Any runs of 2 or more of the same digit should be replaced with a single copy.
- Remove any 0 digits
- Return the first four characters, right-padding with zeroes if there are fewer than four.
- Does that help? 67.76.205.139 (talk) 23:53, 10 January 2008 (UTC)
- This doesn't deal with the case where the second letter of the string has the same code number as the first letter (in which case the second should be ignored).JMG (talk) 02:42, 2 June 2008 (UTC)
Clarity and correctness of Rules section
[edit]The description of the meaning of 'h' and 'w' is not at all clear. The section also states that vowels can affect the coding, but it doesn't say anything about in which way they do. Furthermore one of the examples given is not correct: Ashcraft is coded A261, not A226 (see http://www.archives.gov/genealogy/census/soundex.html ). I suggest the Rules section either be rewritten, or maybe even removed completely. 79.136.60.98 (talk) 23:22, 28 April 2010 (UTC)
- I have rewritten the H & W rules according to the US Census Bureau (http://www.archives.gov/research/census/soundex.html). — Preceding unsigned comment added by Copyeditor42 (talk • contribs) 15:46, 8 February 2012 (UTC)
As noted above, the main page gives the incorrect answer for Ashcraft, which should be A261, not A226, following the rules as I understand them. Moreover, this example is used in the US Census Bureau document cited above, which explicitly says that it should produce A261 not A226. I will therefore update it on the main page.
Njr~enwiki (talk) 09:57, 24 July 2019 (UTC)
The American Soundex section seems redundant now. The rules are nearly identical, and they even use the same examples. The article would benefit by combining them, or limiting them to one or the other. Devinmcginty (talk) 21:08, 26 July 2019 (UTC)
SQL Server 2008's implementation of soundex
[edit]I noticed that soundex in SQL Server 2008 returned A226 for Ashcraft instead of A261. It appears as though they are on the Pre 1920 implementation of the soundex algroithm, for whatever reason (perhaps they are not counting the 'h' as a separator char in the repeating digit rule, or perhaps they have some more advanced rule that is not documented here even (?) ). Or perhaps it's simply one of their bugs.
Additional Information
[edit]It's been a few years since I've even logged in to wiki to post something, so please forgive any inadvertent lapses in wiki etiquette.
I happened to run across this term today on Slashdot (in reference to Tamerlan Tsarnaev) and jumped over here to see what it said about Soundex.
I learned about the term back in 1993 when I started as a police officer. When a user "ran" somebody on the computer system, the user might get a wanted hit on a similar name. That was taught to me as a "Soundex" and further confirmation was needed if the name wasn't an exact match. The joke was that more times than not the "matched" name wasn't even close. I never knew until today exactly how the database actually returned the name.
Anyway, I thought it might be of some interest to those of you that were maintaining this entry.
Trick414 (talk) 02:59, 27 March 2014 (UTC)Trick414
NOTE - if it's of any use - for info -----------------------------. [Anonymous]
The UK Home office had an "advanced" algorithm for UK surnames and street names, which are far more idiomatic in pronunciation than American English. This consisted of a set of rules for name beginnings, name ending, and name 'centres' (i.e. a rolling pass allowed between defined 'begin' and 'end' sequences), which were applied in a defined order of passes, the idea being to remove/reduce known letter 'groups' of common pronunciation, and then the final pass was a 'classic' type letter to number reduction. This system also had an 'exceptions' list of names which were truly idiomatic (an example - "Cholmondeley" is pronounced "Chumley"). I don't know if this system is still used in their computing systems, but was definitely used in 1980 to late 90's for criminal records office database(s). I guess with more immigrant/foreign language surnames around now, perhaps this system is no longer viable
— Preceding unsigned comment added by 219.89.41.218 (talk) 21:32, 2 April 2015 (UTC)
CCA's database system's version of Soundex funtion ($SNDX)
[edit]CCA's database system Model-204 includes a funtion ($SNDX) which works a little differently and doesn't follow the 3-digit rule at all. It will return as many (or few) characters as there continue to be consonants but ignores W, H, Y and doubles, so Ashcraft is encoded as A22613, while Murray becomes M6 (no extra zeroes). It considers the extra length useful for database indexes. [1]Rogerclarinet (talk) 19:46, 13 November 2017 (UTC)
References
- ^ Model 204 User Language manual (various editions) $SNDX
homophone?
[edit]Is the goal really to match homophones? A homophone, according to its wikipedia entry, is "..a word that is pronounced the same (to varying extent) as another word but differs in meaning". Isn't soundex typically used to match words that are pronounced that same and have the same meaning? — Preceding unsigned comment added by JonasRosenqvist (talk • contribs) 08:32, 5 June 2018 (UTC)
- C-Class Computer science articles
- Unknown-importance Computer science articles
- Automatically assessed Computer science articles
- WikiProject Computer science articles
- C-Class Linguistics articles
- Unknown-importance Linguistics articles
- C-Class applied linguistics articles
- Applied Linguistics Task Force articles
- WikiProject Linguistics articles