Language Resources

We gratefully acknowledge support from the National Science Foundation for helping fund this project.

Overview

PsyCoL Hebrew Lexical Corpus (PHLC)

Sources

The PsyCoL Hebrew Lexical Corpus (PMLC) is composed of on-line newspapers from two external research entities:

MILA Knowledge Center for Processing Hebrew

2b-bari, Arutz7, Haaretz, Infomed Medical Forum, Tapuz People Forums, TheMarker. All corpora remain copyright of the respective internet sources listed here.

David Plaut

Haaretz, Maariv, Ynet. All corpora remain copyright of the respective internet sources listed here.

All data was converted and tokenized in UTF-8 and is stored in a relative database structure (MySQL v. 5.0). Each data source below is represented as a single column in the table structure with values that correspond to the full set of unique tokens found across all sources aggregated here. In addition, there is a column for the total token count for each unique token.

- MILA Knowledge Center Sources -

2b-bari

Date Range: ?

Type: Articles/ Forums

Tokens: 709,024

Source Site

Arutz7 Newswires

Date Range: 2001 - 2006

Type: Articles/ Forums

Tokens: 15,107,618

Source Site

Doctors

Date Range: ?

Type: Medical

Tokens: 196,603

Source Site

Haaretz

Date Range: 1991

Type: News/ Articles

Tokens: 8,273,572

Source Site

Infomed Medical Forum

Date Range: January 2006 - September 2007

Type: Medical/ Forum

Tokens: 163,649

Source Site

Tapuz People Forums

Date Range:

Type: Forum

Tokens: 1,004,998

Source Site

TheMarker

Date Range: 2002

Type: Financial Articles

Tokens: 559,438

Source Site

- David Plaut Sources -

Haaretz

Date Range: June 2000 - December 2001

Type: ?

Tokens: ?

Maariv

Date Range: 2000 - 2001

Type:

Tokens: ?

Ynet

Date Range: ?

Type: ?

Tokens: ?

PsyCoL Maltese Lexical Corpus (PMLC)

Sources

The PsyCoL Maltese Lexical Corpus (PMLC) is composed of on-line newspapers within two general data ranges: 1) 1998 - 1999 and 2) 2005 - 2007.

All data was retrieved from the web but the two data ranges cited here reflect two distinct efforts. The first by Albert Gatt who has graciously shared data he collected from various sources (Kulħadd, Leħen, Il-Mument and In-Nazzjon). This work represents 1,395,727 tokens and 53,396 unique types. The second effort was conducted by the PsyCoL lab and includes the all other data collected (Illum, <Malta Right Now>) which adds 1,927,598 <+> tokens to the corpus.

- PsyCoL Sources -

Illum

Date Range: November 12, 2006 - September 30, 2007

Type: News Article/ Opinion

Tokens:

Malta Right Now

Date Range: January 12, 2005 - October 9, 2007

Type: News Article/ Opinion

Tokens: 1,927,598

- Albert Gatt Sources -

Kulħadd

Date Range: April 4, 1998 - May 23, 1999

Type: News Article/ Opinion

Tokens: 69,908

Leħen is-Sewwa

Date Range: February 1999 - June 1999

Type: Church-affiliated organization (Catholic)

Tokens: 23,914

Il-Mument

Date Range: July 1998 - August 1999

Type: News Article/ Opinion

Tokens: 60,982

In-Nazzjon

Date Range: July 1998 - August 1999

Type: News Article/ Opinion

Tokens: 1,240,923

Language Tools

Overview

PsyCoL Hebrew Lexical Corpus (PHLC)

Sources

- MILA Knowledge Center Sources -

2b-bari

Date Range: ?

Type: Articles/ Forums

Tokens: 709,024

Source Site

Arutz7 Newswires

Date Range: 2001 - 2006

Type: Articles/ Forums

Tokens: 15,107,618

Source Site

Doctors

Date Range: ?

Type: Medical

Tokens: 196,603

Source Site

Haaretz

Date Range: 1991

Type: News/ Articles

Tokens: 8,273,572

Source Site

Infomed Medical Forum

Date Range: January 2006 - September 2007

Type: Medical/ Forum

Tokens: 163,649

Source Site

Tapuz People Forums

Date Range:

Type: Forum

Tokens: 1,004,998

Source Site

TheMarker

Date Range: 2002

Type: Financial Articles

Tokens: 559,438

Source Site

- David Plaut Sources -

Haaretz

Date Range: June 2000 - December 2001

Type: ?

Tokens: ?

Maariv

Date Range: 2000 - 2001

Type:

Tokens: ?

Ynet

Date Range: ?

Type: ?

Tokens: ?

PsyCoL Maltese Lexical Corpus (PMLC)

Sources

- PsyCoL Sources -

Illum

Date Range: November 12, 2006 - September 30, 2007

Type: News Article/ Opinion

Tokens:

Malta Right Now

Date Range: January 12, 2005 - October 9, 2007

Type: News Article/ Opinion

Tokens: 1,927,598

- Albert Gatt Sources -

Kulħadd

Date Range: April 4, 1998 - May 23, 1999

Type: News Article/ Opinion

Tokens: 69,908

Leħen is-Sewwa

Date Range: February 1999 - June 1999

Type: Church-affiliated organization (Catholic)

Tokens: 23,914

Il-Mument

Date Range: July 1998 - August 1999

Type: News Article/ Opinion

Tokens: 60,982

In-Nazzjon

Date Range: July 1998 - August 1999

Type: News Article/ Opinion