Overview
PsyCoL Hebrew Lexical Corpus (PHLC)
Sources
The PsyCoL Hebrew Lexical Corpus (PMLC) is composed of on-line newspapers from two external research entities:
MILA Knowledge Center for Processing Hebrew
2b-bari, Arutz7, Haaretz, Infomed Medical Forum, Tapuz People Forums, TheMarker. All corpora remain copyright of the respective internet sources listed here.
David Plaut
Haaretz, Maariv, Ynet. All corpora remain copyright of the respective internet sources listed here.
All data was converted and tokenized in UTF-8 and is stored in a relative database structure (MySQL v. 5.0). Each data source below is represented as a single column in the table structure with values that correspond to the full set of unique tokens found across all sources aggregated here. In addition, there is a column for the total token count for each unique token.
- MILA Knowledge Center Sources -
2b-bari
Date Range: ?
Type: Articles/ Forums
Tokens: 709,024
Source Site
Arutz7 Newswires
Date Range: 2001 - 2006
Type: Articles/ Forums
Tokens: 15,107,618
Source Site
Doctors
Date Range: ?
Type: Medical
Tokens: 196,603
Source Site
Haaretz
Date Range: 1991
Type: News/ Articles
Tokens: 8,273,572
Source Site
Infomed Medical Forum
Date Range: January 2006 - September 2007
Type: Medical/ Forum
Tokens: 163,649
Source Site
Tapuz People Forums
Date Range:
Type: Forum
Tokens: 1,004,998
Source Site
TheMarker
Date Range: 2002
Type: Financial Articles
Tokens: 559,438
Source Site
- David Plaut Sources -
Haaretz
Date Range: June 2000 - December 2001
Type: ?
Tokens: ?
Maariv
Date Range: 2000 - 2001
Type:
Tokens: ?
Ynet
Date Range: ?
Type: ?
Tokens: ?
PsyCoL Maltese Lexical Corpus (PMLC)
Sources
The PsyCoL Maltese Lexical Corpus (PMLC) is composed of on-line newspapers within two general data ranges: 1) 1998 - 1999 and 2) 2005 - 2007.
All data was retrieved from the web but the two data ranges cited here reflect two distinct efforts. The first by Albert Gatt who has graciously shared data he collected from various sources (Kulħadd, Leħen, Il-Mument and In-Nazzjon). This work represents 1,395,727 tokens and 53,396 unique types. The second effort was conducted by the PsyCoL lab and includes the all other data collected (Illum, <Malta Right Now>) which adds 1,927,598 <+> tokens to the corpus.
All data was converted and tokenized in UTF-8 and is stored in a relative database structure (MySQL v. 5.0). Each data source below is represented as a single column in the table structure with values that correspond to the full set of unique tokens found across all sources aggregated here. In addition, there is a column for the total token count for each unique token.