Zed Gecko
Joined: 23 Sep 2006 Posts: 120
|
Posted: Sun Aug 12, 2007 5:14 pm Post subject: gender-verification by forename (cmd-line-tool & db) |
|
|
Just recently published by a german magazine: a tool to decide if a name is female or male.
ftp://ftp.heise.de/pub/ct/listings/0717-182.zip
www.heise.de/ct soft-link 0717182
this zip contains a cmd-line tool, the c-source and the txt-file with the match-data for about 40000 Names.
| Quote: |
Overview of the program "gender" by Jörg MICHAEL
The program "gender.c" is a program for determining the gender of a given
fist name.
List of files:
a) gen_ext.h (contains macros and prototypes; may be changed)
b) umlaut.h (contains lists of umlauts)
c) gender.c (this is the "workhorse" of the program)
d) nam_dict.txt (dictionary file containing first names)
The file "nam_dict.txt", which contains a list of first names, uses the
char set "iso8859-1".
If you want to use "gender.c" as a library, delete the line
"#define GENDER_EXECUTABLE" from the file "gen_ext.h".
========================================================================
The dictionary file "nam_dict.txt"
The program "gender.c" uses the dictionary file "nam_dict.txt" as a data
source. This file contains a list of more than 40,000 first names and
gender, plus some 600 pairs of "equivalent" names.
This list should be able to cover the vast majority of first names
in all European contries and in some overseas countries (e.g. China,
India, Japan, U.S.A.) as well.
Also included in this file is information on the approximate frequency
of each name. The scale goes from 1 (=rare) to 13 (=extremely common).
The value 10 has been formatted to represent at least 2 percent of
the population. (The values 11 to 13 have been added last.)
The scale is logarithmic. For countries with very good statistics,
each step (down to frequency 2) represents a factor of 2.
For example, a frequency value of 7 means that the correspondig first
name has an absolute in the range of 0.25 % to 0.5 %.
The sorting order of the file "nam_dict.txt" is governed by the search
algorithm of the program "gender.c". Hence, names with "expandable"
umlauts can be found twice in this dictionary, first with sorting
according to "expanded" umlauts, and second with sorting according to
"compressed" umlauts (e.g. 'Ö' is sorted like "Oe" and 'O').
You don't have to reformat this file for use in a unix environment,
because the DOS linefeeds (trailing '\r') are ignored when the file
is read.
========================================================================
A few words on quality of data
The dictionary of first names has been prepared with utmost care.
For example, the Turkish, Indian and Korean names in this dictionary
have all been independently lassified by several native speakers.
I also took special care to list only those names which can currently
be found.
The lesson from this?
Any modifications should be done very cautiously (and they must also
adhere to the sorting required by the search algorithm).
For example, knowing that "Sascha" is a boy's name in Germany, the author
never assumed the English "Sasha" to be a girl's name.
Knowing that "Jan" is a boy's name in Germany, I never assumed it to be
also a English short form of "Janet". Another case in point is the name
"Esra". This is a boy's name in Germany, but a girl's name in Turkey.
Or consider the following first names:
Ildikó female Hungarian name
Mitja male Russian name
Elizaveta rare name; looks like misspelled "Elizabeta"
Roelf rare name; looks like German "Rolf" with an erroneous 'e'
Borchert, Oltmann, Sievert, Hartmann look like common German surnames
|
the tool is released under the LGPL.
I have created a little Gui for the main-function of gender.exe.
The script should be stored in the same directory as gender.exe.
| Code: | #NoTrayIcon
SetWorkingDir %A_ScriptDir%
;------------auto-execute----------------------------------------------------
IfNotExist, gender.exe
{
MsgBox, gender.exe not found!
ExitApp
}
IfNotExist, nam_dict.txt
{
MsgBox, nam_dict.txt not found!
ExitApp
}
Gui, +Resize
Gui, Margin, 3, 3
Gui, Add, Tab, w284 h260 vMyTab, Get Gender|Check Nickname|List Names|Statistics
Gui, Tab, Get Gender
Gui, Font, S11
Gui, Add, Edit, x8 y30 R1 W270 vMyNameString
Gui, Font, S8
Gui, Add, Button, Default gCheckGender x8 y+5, &Check Gender
Gui, Add, Button, gCheckGenderTrace x+38, Check Gender (Display&Trace)
Gui, Font, S11
Gui, Add, Edit, R9 W270 x8 y+5 vMyResultField ReadOnly
Gui, Font, S8
Gui, Add, Checkbox, x8 y+5 vUseHotkey gActivateHotkey, Use Alt+G to check selected text for gender
Gui, Tab, Check Nickname
Gui, Font, S11
Gui, Add, Text, x8 y33, Name 1:
Gui, Add, Edit, x58 y30 R1 W220 vMyNickAString
Gui, Add, Text, x8 y63, Name 2:
Gui, Add, Edit, x58 y60 R1 W220 vMyNickBString
Gui, Font, S8
Gui, Add, Button, gCheckNick x59 y90, Check, if two first &Names are "equivalent"
Gui, Font, S11
Gui, Add, Edit, R8 W270 x8 y+5 vMyNickResultField ReadOnly
Gui, Tab, List Names
Gui, Add, Text, x8 y33, Country :
Gui, Add, Edit, x60 y30 R1 W218 vMyCountryString
Gui, Font, S8
Gui, Add, Button, gListNames x61 y60, &List all names of the given country.
Gui, Font, S11
Gui, Add, Edit, R10 W270 x8 y+5 vMyCountryResultField ReadOnly
Gui, Tab, Statistics
Gui, Font, S8
Gui, Add, Button, gShowStats x8 y33, &Show statistics
Gui, Font, S11
Gui, Add, Edit, R11 W270 x8 y+7 vMyStatResultField ReadOnly
Gui, Show, , Gender Verification
return
return
;--------------End-auto-execute----------------------------------------------
;--------------gender.exe related--------------------------------------------
CheckGender:
Gui, Submit, Nohide
Gui +Disabled
Gui, Flash
StringLeft, MyNameString, MyNameString, 100
RunWait, %comspec% /c ""%A_WorkingDir%\gender.exe" "-get_gender" "%MyNameString% " >"RESULT.TXT"", , Hide UseErrorlevel
if ErrorLevel = ERROR
GuiControl, , MyResultField, Calling gender.exe produced an error!
else
{
FileRead, MyResult, Result.txt
GuiControl, , MyResultField, %MyResult%
}
Gui -Disabled
Gui, Flash
Gui, Flash, Off
FileDelete, Result.txt
return
CheckGenderTrace:
Gui, Submit, Nohide
Gui +Disabled
Gui, Flash
StringLeft, MyNameString, MyNameString, 100
RunWait, %comspec% /c ""%A_WorkingDir%\gender.exe" "-get_gender" "%MyNameString% " "-trace" >"RESULT.TXT"", , Hide UseErrorlevel
if ErrorLevel = ERROR
GuiControl, , MyResultField, Calling gender.exe produced an error!
else
{
FileRead, MyResult, Result.txt
GuiControl, , MyResultField, %MyResult%
}
Gui -Disabled
Gui, Flash
Gui, Flash, Off
FileDelete, Result.txt
return
CheckSelectedforGender:
ClipSaved := ClipboardAll
Send ^c
ClipWait, 4
if ErrorLevel
{
GuiControl, , MyResultField, The attempt to copy text onto the clipboard failed.
return
}
Loop, parse, Clipboard, `n, `r ; Specifying `n prior to `r allows both Windows and Unix files to be parsed.
{
MyNameString := A_LoopField
break
}
StringLeft, MyNameString, MyNameString, 100
Gui +Disabled
Gui, Flash
RunWait, %comspec% /c ""%A_WorkingDir%\gender.exe" "-get_gender" "%MyNameString% " >"RESULT.TXT"", , Hide UseErrorlevel
if ErrorLevel = ERROR
GuiControl, , MyResultField, Calling gender.exe produced an error!
else
{
FileRead, MyResult, Result.txt
GuiControl, , MyNameString, %MyNameString%
GuiControl, , MyResultField, %MyResult%
GuiControl, , MyNickAString, %MyNameString%
ToolTip, %MyResult%
SetTimer, RemoveToolTip, 5000
}
Gui -Disabled
Gui, Flash
Gui, Flash, Off
FileDelete, Result.txt
Clipboard := ClipSaved
ClipSaved =
return
RemoveToolTip:
SetTimer, RemoveToolTip, Off
ToolTip
return
CheckNick:
Gui +Disabled
Gui, Flash
Gui, Submit, Nohide
StringLeft, MyNameString, MyNickAString, 100
StringLeft, MyNameString, MyNickBString, 100
RunWait, %comspec% /c ""%A_WorkingDir%\gender.exe" "-check_nickname" "%MyNickAString% " "%MyNickBString% " >"RESULT.TXT"", , Hide UseErrorlevel
if ErrorLevel = ERROR
GuiControl, , MyNickResultField, Calling gender.exe produced an error!
else
{
FileRead, MyResult, Result.txt
GuiControl, , MyNickResultField, %MyResult%
}
Gui -Disabled
Gui, Flash
Gui, Flash, Off
FileDelete, Result.txt
return
ListNames:
Gui, Submit, Nohide
Gui +Disabled
Gui, Flash
StringLeft, MyNameString, MyCountryString, 100
RunWait, %comspec% /c ""%A_WorkingDir%\gender.exe" "-print_names_of_country" "%MyCountryString%" "RESULT.TXT"", , Hide UseErrorlevel
if ErrorLevel = ERROR
GuiControl, , MyCountryResultField, Calling gender.exe produced an error!
else
{
FileRead, MyResult, Result.txt
GuiControl, , MyCountryResultField, %MyResult%
}
Gui -Disabled
Gui, Flash
Gui, Flash, Off
FileDelete, Result.txt
return
ShowStats:
Gui +Disabled
Gui, Flash
RunWait, %comspec% /c ""%A_WorkingDir%\gender.exe" "-statistics" >"RESULT.TXT"", , Hide UseErrorlevel
if ErrorLevel = ERROR
GuiControl, , MyStatResultField, Calling gender.exe produced an error!
else
{
FileRead, MyResult, Result.txt
GuiControl, , MyStatResultField, %MyResult%
}
Gui -Disabled
Gui, Flash
Gui, Flash, Off
FileDelete, Result.txt
return
;---------------End-gender.exe related----------------------------------------
;---------------Hotkey related------------------------------------------------
ActivateHotkey:
Gui, Submit, Nohide
if (UseHotkey = 1)
{
Hotkey, !g, CheckSelectedforGender, ON
}
if (UseHotkey = 0)
{
Hotkey, !g, CheckSelectedforGender, OFF
}
return
;---------------End-Hotkey related--------------------------------------------
;---------------Gui related---------------------------------------------------
GuiSize:
if (A_EventInfo != 1)
{
if (A_GuiWidth < 290)
Gui, Show, w290
if (A_GuiHeight < 260)
Gui, Show, h260
}
GuiControl, Move, MyTab, % "w" A_GuiWidth-6 "h" A_GuiHeight-6
GuiControl, Move, MyResultField, % "w" A_GuiWidth-15 "h" A_GuiHeight-115
GuiControl, Move, UseHotkey, % "y" A_GuiHeight-25
GuiControl, Move, MyNickResultField, % "w" A_GuiWidth-15 "h" A_GuiHeight-130
GuiControl, Move, MyCountryResultField, % "w" A_GuiWidth-15 "h" A_GuiHeight-100
GuiControl, Move, MyStatResultField, % "w" A_GuiWidth-15 "h" A_GuiHeight-75
return
GuiClose:
ExitApp
;---------------End-Gui related-----------------------------------------------
|
The exe- and the ahk-file can be downloaded here: http://www.autohotkey.net/~Zed_Gecko/gender/WinGender2.zip
Last edited by Zed Gecko on Fri Jan 09, 2009 5:38 pm; edited 2 times in total |
|