Linguistics Stimuli
The following is forked with permission from (and almost identical to) The Language Goldmine.
URL | Title | Description | Tags | Languages | Associated Publication |
---|---|---|---|---|---|
http://concepticon.clld.org/ | Concepticon | Links 9611 concepts from 51 different concept lists to 2206 different concept sets, 243 relations between concepts are defined | semantics, concepts, lexicon structure, vocabulary | multilingual | |
http://clics.lingpy.org/ | Database of Cross-Linguistic Colexifications | Gives polysemy information for 221 different languages covering 64 families (more than 300000 words and 10000 concepts) | semantics, concpts, polysemy, lexicon structure, vocabulary, typology | multilingual | List, J.-M., Terhalle, A., & Urban, M. (2013). Using network approaches to enhance the analysis of cross-linguistic polysemies. Proceedings of the 10th International Conference on Computational Semantics (pp. 347-353). Association for Computational Linguistics. |
https://archive.org/details/tv | The TV News Archive | Contains more than 705,000 captioned and searchable news programs from over 4 years of U.S. television networks | semantics, gesture, phonetics, corpus, TV, media, politics, news, multimodal corpus | English, Spanish | |
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl | SUBTLEX-NL | Dutch word frequencies based on 44 million words from film and television subtitles | word frequency, contextual diversity | Dutch | Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650. |
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-ch | SUBTLEX-CH | Chinese word frequencies based on 33.5 million words from film and television subtitles | word frequency, part of speech (POS), lexical decision task, reaction times (RT), response latency | Chinese | Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. |
http://www.bcbl.eu/databases/subtlex-gr/ | SUBTLEX-GR | Modern Greek word frequencies based on 23 million words from film and television subtitles | word frequency, orthographic neighborhood density, orthgraphic levensthein distance, contextual diversity | Greek | Dimitropoulou, M., Duñabeitia, J., Avilés, A., Corral, J.& Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behaviour: the case of Greek.Frontiers in Psychology, 1:218, 1-12. |
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-pl | SUBTLEX-PL | Polish word frequencies based on 101 million words from film and television subtitles | word frequency | Polish | Mandera, P., Keuleers, E., Wodniecka, Z., & Brysbaert, M. (2014). Subtlex-pl: subtitle-based word frequency estimates for Polish. Behavior research methods, 47(2), 471-483. |
http://crr.ugent.be/archives/1423 | SUBTLEX-UK | British English word frequencies based on 201.3 million words from 45,099 BBC broadcasts | word frequency, contexutal diversity, word frequency in childrens programs, part of speech (POS), bigram frequencies | English | Van Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. |
http://crr.ugent.be/archives/534 | SUBTLEX-DE | German word frequencies of 25.4 million words from film and television subtitles | word frequency | German | Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424 |
http://www.dlexdb.de/ | Digitales Woerterbuch der deutschen Sprache (dlexDB) | Over 100 million German word tokens, neighborhood densities and bigram and trigram probabilities based on different registers | word frequency, bigram probability, trigram probability, neighborhood density, conditional probability | German | Heister, J., Wuerzner, K. M., Bubenzer, J., Pohl, E., Hanneforth, T., Geyken, A., & Kliegl, R. (2011). dlexDB-A lexical database for the psychological and linguistic research. Psychologische Rundschau, 62(1), 10-20. |
http://wortschatz.uni-leipzig.de/ | Leipzig Wortschatz Lexicon | German thesaurus and lexical network | thesaurus, lexical network | German | |
http://corpora2.informatik.uni-leipzig.de/ | Leipzig Corpus Collection | Contains frequencies and co-occurrence information for 219 languages | word frequency, corpus | multilingual | Quasthoff, U., Richter, M., Biemann, C. (2006). Corpus Portal for Search in Monolingual Corpora. Proceedings of the fifth international conference on Language Resources and Evaluation, LREC 2006, Genoa, pp. 1799-1802. |
http://crr.ugent.be/archives/806 | Kuperman English Age-of-acquisition ratings | Age-of-acquisition ratings for 30,000 English words. | age of acquisition (AOA) | English | Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978-990. |
http://crr.ugent.be/archives/1003 | Warriner English Affective Ratings | Valence, arousal and dominance ratings for 13,915 English words | emotion, valence, dominance, arousal, affect, positive, negative | English | Warriner, A.B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45, 1191-1207. |
http://crr.ugent.be/archives/1330 | Brysbaert English Concreteness Ratings | Concreteness ratings for 40,000 English words | concreteness | English | Brysbaert, M., Warriner, A.B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911. |
http://crr.ugent.be/archives/1602 | Brysbaert Dutch Age-of-acquisition & Concreteness ratings | Age-of-acquisition and concreteness ratings for 30,000 Dutch words | concreteness, age of acquisition (AOA) | Dutch | Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80-84. |
http://crr.ugent.be/archives/878 | Moors Dutch Affective Ratings | Valence, arousal and dominance ratings for 4,300 Dutch words | emotion, valence, dominance, arousal, affect, positive, negative | Dutch | Moors, A., De Houwer, J., Hermans, D., Wanmaker, S., van Schie, K., Van Harmelen, A. L., De Schryver, M., De Winne, J., & Brysbaert, M. (2013). Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words. Behavior research methods, 45(1), 169-177. |
https://sites.google.com/site/frenchlexicon/results | French Lexicon Project | Lexical decision data for 38,840 French words and 38,840 pseudowords | lexicon project, psycholinguistic database, reaction times (RT), response latency, word frequency | French | Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., MĂ©ot, A., Augustinova, M., & Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42, 488-496. |
http://www.lexique.org/ | Lexique | French lexical database for 135, 000 words | lexicon project, psycholinguistic database, reaction times (RT), response latency, word frequency | French | |
http://link.springer.com/article/10.3758%2FBRM.42.4.992 | Malay Lexicon Project | Malay lexical database for 9,592 words | lexicon project, psycholinguistic database, reaction times (RT), response latency, word frequency | Malay | Yap, M. J., Liow, S. J. R., Jalil, S. B., & Faizal, S. S. B. (2010). The Malay Lexicon Project: A database of lexical statistics for 9,592 words. Behavior research methods, 42(4), 992-1003. |
http://elexicon.wustl.edu/ | English Lexicon Project (ELP) | English lexical database for 40,481 words | lexicon project, psycholinguistic database, reaction times (RT), response latency, lexical decision task, word naming, contextual diversity, neighborhood density, bigram probability, part of speech (POS), levenshtein distance, SUBTLEX | English | Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459. |
http://journal.frontiersin.org/article/10.3389/fpsyg.2010.00174/abstract | Dutch Lexicon Project | Dutch lexical database for 14,000 words | lexicon project, psycholinguistic database, reaction times (RT), response latency, word frequency | Dutch | Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono-and disyllabic words and nonwords. Frontiers in Psychology, 1, 174. |
http://crr.ugent.be/programs-data/lexicon-projects | British Lexicon Project (BLP) | British English lexical database for 28,000 words | lexicon project, psycholinguistic database, reaction times (RT), response latency, bigram probability, trigram probability, word frequency | English | Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287-304. |
http://crr.ugent.be/programs-data/word-prevalence-values | Dutch Word Knowledge & Prevalence | Word prevalence values for 54,319 Dutch words from nearly 300,000 participants | word prevalence, word knowledge, lexical knowledge | Dutch | Keuleers, E., Stevens, M., Mandera, P., & Brysbaert, M. (2015). Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. The Quarterly Journal of Experimental Psychology, (ahead-of-print), 1-28. |
http://semshifts.iling-ran.ru/ | Database of Semantic Shifts in the Languages of the World | 3,690 semantic connections in the world's languages (polysemy, semantic changes) | polysemy, semantic change, semantics | multilingual | Zalizniak, A. A., Bulakh, M., Ganenkov, D., Gruntov, I., Maisak, T., & Russo, M. (2012). The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics, 50, 633-669. |
http://w3.usf.edu/FreeAssociation/ | USF Word Association Norms | Free word association data for 72,000 word pairs | word association, semantics | English | Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36, 402-407. |
http://link.springer.com/article/10.3758/BRM.41.2.558 | Modality Exclusivity Norms for Adjectives | How much 423 adjectives are associated with different modalities/senses (e.g., vision, hearing) | modality exclusivity, senses, semantics, adjectives | English | Lynott, D., & Connell, L. (2009). Modality exclusivity norms for 423 object properties. Behavior Research Methods, 41, 558-564. |
http://www.psych.rl.ac.uk/ | MRC Psycholinguistics Database | Lexical database for English | lexicon project, psycholinguistic database, reaction times (RT), response latency, word frequency, concreteness, familiarity, imageability, meaningfulness, part of speech (POS), lexical category, part of speech, number of phonemes, syllables, letters, stress-marked phonetic transcription | English | Coltheart, M. (1981). The MRC Psycholinguistic Database. Quarterly Journal of Experimental Psychology, 33A, 497-505.; Wilson, M.D. (1988). The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20, 6-11. |
http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/overview.htm | SUBTLEX-US | Frequencies from 51 million word tokens | word frequency, contextual diversity | English | Brysbaert, M., & New, B. (2009). Moving beyond KuÄera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990.; Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior research methods, 44(4), 991-997. |
http://www.natcorp.ox.ac.uk/ | British National Corpus (BNC) | Corpus based with 100 million words | corpus | English | |
http://wals.info/ | World Atlas of Language Structures (WALS) | Typological database | typology, grammatical database, syntax, morphology, phonology | multilingual | Dryer, Matthew S. & Haspelmath, Martin (eds.) (2013). The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. |
http://wold.clld.org/ | The World Loanword Database (WOLD) | Loanword database with mini-dictionaries for 41 languages; words are coded for likelihood of being a loanword | loanwords, borrowing, language contact | multilingual | Haspelmath, M., & Tadmor, U. (Eds.). (2009). Loanwords in the world's languages: a comparative handbook. Walter de Gruyter. |
https://sites.google.com/site/kenmcraelab/norms-data | Semantic Feature Norms | Semantic feature norms for 541 concepts from 725 participants | semantic features, semantics, feature norms, distinctive features, objects and events, properties | English | McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior research methods, 37(4), 547-559. |
https://books.google.com/ngrams | Google Ngram | Large corpora of books with word frequencies and ngram frequencies from English, German, French, Italian, Spanish, Russian, Chinese and Hebrew, POS-tagged | ngrams, word frequency | multilingual | Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., ... & Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized books. science, 331(6014), 176-182. |
http://wordnet.princeton.edu/ | WordNet | Machine-readable dictionary with semantic relations for English | wordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchies | English | Miller GA. WordNet: a lexical database for English. Communications of the ACM. 1995;38:39-41.; Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press; 1998. |
http://www.eat.rl.ac.uk/ | Edinburgh Associative Thesaurus | English word association norms | word association, semantics | English | Kiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973) An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), The Computer and Literary Studies. Edinburgh: University Press. |
http://www.illc.uva.nl/EuroWordNet/ | Euro WordNet | Wordnets for several European languages | wordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchies | multilingual | |
http://corpus.byu.edu/coca/ | Corpus of Contemporary American English (COCA) | 440 million word corpus of contemporary American English | corpus | English | |
http://corpus.byu.edu/coha/ | Corpus of Historical American English (COHA) | 385 million word corpus of historical American English | corpus | English | Davies, M. (2011). The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English. Literary and Linguistic Computing 25: 447-65. |
http://lingweb.eva.mpg.de/ids/ | Intercontinental Dictionary Series (IDS) | Comparative lexical database | lexicon, dictionary, vocabulary | multilingual | |
http://phoible.org/ | PHOIBLE: The world's largest database of phonological inventories | Cross-linguistic phoneme inventory data | typology, phonemes, phonetics, phonology phoneme inventory | multilingual | Moran, S., & McCloy, D., & Wright, R. (eds.) 2014. PHOIBLE Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. |
http://137.122.133.199/~Jeff/pbase/index.html | P-base | Database of several thousand sound patterns in 500+ languages | phonetics, phonology, typology | multilingual | |
http://137.122.133.199/~Jeff/phonetic_similarity/index.html | Phonetic Similarity database | Similarity ratings for 51 segments | phonetics, phonology | multilingual | |
http://www.autotyp.uzh.ch/ | AUTOTYP | Typological database | language family, language area, genealogy, geography, genealogical data, typology, language history | multilingual | Bickel, B. & J. Nichols, 2002. Autotypologizing databases and their use in fieldwork. In Proc. Int. LREC Workshop on Resources and Tools in Field Linguistics. Las Palmas, 25-26 May 2002. |
http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus | KAIST Korean Corpus | 70 million Korean eojeol corpus | corpus | Korean | |
http://www.statmt.org/europarl/ | Europarl Parallel Corpus (EPC) | Parallel text with up to 60 million words for 20 languags | parallel corpus, corpus, translation | multilingual | |
http://paralleltext.info/data/ | Parallel Bible Corpus (PBC) | Parallel corpus of the bible with around 900 languages from around 80 different language families | parallel corpus, corpus, translation | multilingual | |
http://unicode.org/udhr/ | Universal Declaration of Human Rights | Parallel corpus of the declaration of human rights for 400 languages | parallel corpus, corpus, translation | multilingual | |
http://wacky.sslmit.unibo.it/doku.php?id=corpora | Wacky Corpus | Syntactically annotated or POS-tagged corpora with up to 2 billion words for English, French, German and Italian, also includes Italian Wikipedia corpus | corpus, syntax, treebank, part of speech (POS), Wikipedia | English, French, German, Italian | |
http://asjp.clld.org/ | Automated Similarity Judgment Project (ASJP) | Word lists of around 6000 languages | word list, typology, vocabulary | multilingual | Wichmann, SĂžren, AndrĂ© MĂŒller, Annkathrin Wett, Viveka Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W. Holman, Sebastian Sauppe, Zarina Molochieva, Pamela Brown, Harald Hammarström, Oleg Belyaev, Johann-Mattis List, Dik Bakker, Dmitry Egorov, Matthias Urban, Robert Mailhammer, Agustina Carrizo, Matthew S. Dryer, Evgenia Korovina, David Beck, Helen Geyer, Patience Epps, Anthony Grant, and Pilar Valenzuela. 2013. The ASJP Database. |
https://www.ethnologue.com/ | Ethnologue | Comprehensive catalogue of the world's languages | catalogue, typology | multilingual | |
http://glottolog.org/ | Glottolog | Comprehensive reference information for the world's languages | catalogue, typology | multilingual | |
http://opus.lingfil.uu.se/ | Open parallel corpus | A collection of parallel corpora including 71 million sentences for about 30 languages | parallel corpus, corpus, translation | multilingual | |
http://archive.org/browse.php?field=subject&mediatype=texts&collection=rosettaproject | Rosetta Collection in The Internet Archive | Media files and documents about the languages from the world collected by the Rosetta foundation | reference grammar, word list | multilingual | |
http://data.worldbank.org/ | World Bank Open Data | Demographic and geographical data on the world's countries | demographic data, country, migration, bilingualism, language use | non-linguistic | |
http://link.springer.com/article/10.3758/s13428-012-0267-0 | Modality Exclusivity Norms for Nouns | Modality norms from for 400 English nouns | perceptual attributes, modality exclusivity, senses, semantics, vision, sight, hearing, touch, taste, smell | English | Lynott, D., & Connell, L. (2013). Modality exclusivity norms for 400 nouns: The relationship between perceptual experience and surface word form. Behavior research methods, 45(2), 516-526. |
http://link.springer.com/article/10.3758/s13428-010-0038-8 | Modality Exclusivity Norms for Adjectives | Modality norms from 400 American English participants for 387 adjectives | perceptual attributes, modality exclusivity, senses, semantics, vision, sight, hearing, touch, taste, smell | English | van Dantzig, S., Cowell, R. A., Zeelenberg, R., & Pecher, D. (2011). A sharp image or a sharp knife: Norms for the modality-exclusivity of 774 concept-property items. Behavior Research Methods, 43(1), 145-154. |
http://link.springer.com/article/10.3758/s13428-012-0215-z | Perceptual and motor attribute ratings | Perceptual and motor attribute ratings for 559 concepts based on 376 American English participants | graspability, perceptual attributes, semantics | English | Amsel, B. D., Urbach, T. P., & Kutas, M. (2012). Perceptual and motor attribute ratings for 559 object concepts. Behavior research methods, 44(4), 1028-1041. |
http://link.springer.com/article/10.3758/s13428-012-0242-9 | Sensory experience ratings | Sensory experience ratings for 5857 English words based on 63 participants | perceptual attributes, semantics | English | Juhasz, B. J., & Yap, M. J. (2013). Sensory experience ratings for over 5,000 mono-and disyllabic words. Behavior research methods, 45(1), 160-168. |
http://link.springer.com/article/10.3758/s13428-014-0488-5 | Manipulability and naming norms for photographs | Manipulability ratings and naming RT norms for photographs | perceptual attributes, manipulability | non-linguistic | |
http://link.springer.com/article/10.3758/BRM.42.1.82 | Manipulability, familiarity and AOA for photographs | Manipulability, familiarity and AOA for photographs | perceptual attributes, manipulability, age of acquisition (AOA), familiarity | non-linguistic | Salmon, J. P., McMullen, P. A., & Filliter, J. H. (2010). Norms for two types of manipulability (graspability and functional usage), familiarity, and age of acquisition for 320 photographs of objects. Behavior Research Methods, 42(1), 82-95. |
http://www.tandfonline.com/doi/abs/10.1080/13825585.2010.540849 | Spanish norms for photographs | 140 color images that have been normed by 106 Spanish speakers on age of acquisition, familiarity, manipulability and other measures | age of acquisition (AOA), perceptual attributes, manipulability | Spanish | Moreno-MartĂnez, F. J., Montoro, P. R., & Laws, K. R. (2011). A set of high quality colour images with Spanish norms for seven relevant psycholinguistic variables: The Nombela naming test. Aging, Neuropsychology, and Cognition, 18(3), 293-327. |
http://link.springer.com/article/10.3758/s13428-014-0466-y | French acronym norms | Psycholinguistic norms for French acronyms | acronyms, reading time (RT), age of acquisition ratings (AOA), subjective frequency, imageability | French | Bonin, P., MĂ©ot, A., Millotte, S., & Bugaiska, A. (2014). Norms and reading times for acronyms in French. Behavior research methods, 47(1), 251-267. |
http://link.springer.com/article/10.3758/s13428-014-0454-2 | Spanish AOA norms | Subjective age-of-acquisition norms for 7,039 Spanish words | age of acquisition (AOA) | Spanish | Alonso, M. A., Fernandez, A., & DĂez, E. (2015). Subjective age-of-acquisition norms for 7,039 Spanish words. Behavior research methods, 47(1), 268-274. |
http://link.springer.com/article/10.3758/s13428-014-0467-x | Persian emotional speech | Emotional speech from 470 sentences normed by 1,126 Persian native speakers | emotion, emotional speech | Persian | Keshtiari, N., Kuhlmann, M., Eslami, M., & Klann-Delius, G. (2015). Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD). Behavior research methods, 47(1), 275-294. |
http://www.csl.psychol.cam.ac.uk/propertynorms/ | CSLB Property Norms | Feature norms for 866 concrete concepts by 123 native speakers of British English | semantic features, semantics | English | Devereux, B. J., Tyler, L. K., Geertzen, J., & Randall, B. (2014). The Centre for Speech, Language and the Brain (CSLB) concept property norms. Behavior research methods, 46(4), 1119-1127. |
http://link.springer.com/article/10.3758/s13428-013-0426-y | ANGST German affectiveness ratings | Valence, arousal, dominance and other ratings for 1,003 German words | emotion, valence, dominance, arousal, affect, positive, negative | German | Schmidtke, D. S., Schröder, T., Jacobs, A. M., & Conrad, M. (2014). ANGST: Affective norms for German sentiment terms, derived from the affective norms for English words. Behavior Research Methods, 46(4), 1108-1118. |
http://link.springer.com/article/10.3758/s13428-013-0431-1 | French affective norms | Affective norms for 1,031 French words by 469 French speakers | emotion, valence, dominance, arousal, affect, positive, negative | French | Monnier, C., & Syssau, A. (2014). Affective norms for French words (FAN). Behavior research methods, 46(4), 1128-1137. |
http://link.springer.com/article/10.3758/s13428-013-0409-z | Gender stereotypicality norms | Gender stereotypicality norms for role nouns in 7 European languages | gender, stereotypes, stereotypicality, sociolinguistics | Czech, English, French, German, Italian, Norwegian, Slovak | Misersky, J., Gygax, P. M., Canal, P., Gabriel, U., Garnham, A., Braun, F., ... & Sczesny, S. (2014). Norms on the gender perception of role nouns in Czech, English, French, German, Italian, Norwegian, and Slovak. Behavior research methods, 46(3), 841-871. |
http://link.springer.com/article/10.3758/s13428-013-0400-8 | PhonItalia | Phonological representatons for 120,000 Italian word forms | phonological representation, phonology, transcription | Italian | Goslin, J., Galluzzi, C., & Romani, C. (2014). PhonItalia: a phonological lexicon for Italian. Behavior research methods, 46(3), 872-886. |
http://link.springer.com/article/10.3758/s13428-013-0405-3 | Italian affective norms | Affective norms for 1,121 Italian words | emotion, valence, dominance, arousal, affect, positive, negative | Italian | Montefinese, M., Ambrosini, E., Fairfield, B., & Mammarella, N. (2014). The adaptation of the affective norms for english words (ANEW) for Italian. Behavior research methods, 46(3), 887-903. |
http://link.springer.com/article/10.3758/s13428-013-0370-x | Subjetive ASL frequency | Subjective frequency ratings for 432 ASL signs from 59 native deaf signers | subjective frequency, familiarity, American Sign Language (ASL) | ASL | Mayberry, R. I., Hall, M. L., & Zvaigzne, M. (2014). Subjective frequency ratings for 432 ASL signs._Behavior research methods,_46(2), 526-539. |
http://link.springer.com/article/10.3758/s13428-013-0389-z | Japanese-English similarity for translation equivalents | 193 Japanese-English word pairs are rated for phonological and semantic similarity | phonological similarity, phonetics, phonology, semantics, semantic similiarity, translation, translation equivalent | Japanese;English | Allen, D., & Conklin, K. (2014). Cross-linguistic similarity norms for JapaneseâEnglish translation equivalents. Behavior research methods, 46(2), 540-563. |
http://link.springer.com/article/10.3758/s13428-013-0388-0 | Portuguese Free Association norms | Free association norms for 139 Portuguse words from children of various ages | children, language acquisition, word association, free association | Portuguese | Comesaña, M., Fraga, I., Moreira, A. J., Frade, C. S., & Soares, A. P. (2014). Free associate norms for 139 European Portuguese words for children from different age groups. Behavior research methods, 46(2), 564-574. |
http://link.springer.com/article/10.3758/s13428-013-0376-4 | Turkish image norms | Turkish AOA, familiarity and other norms for 260 pictures from 277 native Turkish speakers | familiarity, age of acquisition (AOA), word frequency | Turkish | Raman, I., Raman, E., & Mertan, B. (2014). A standardized set of 260 pictures for Turkish: Norms of name and image agreement, age of acquisition, visual complexity, and conceptual familiarity. Behavior research methods, 46(2), 588-595. |
http://link.springer.com/article/10.3758/s13428-013-0355-9 | Chinese Lexicon project | Reaction times for 2500 single characters and associated lexical norms (frequency, contextual diversity etc.) | contextual diversity, word frequency, lexical decision task, reaction times (RT), response latency | Chinese | Sze, W. P., Liow, S. J. R., & Yap, M. J. (2014). The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behavior research methods, 46(1), 263-273. |
http://link.springer.com/article/10.3758/s13428-013-0358-6 | Dutch action norms | Dutch AOA, word frequency and other norms for 124 line drawings | age of acquisition (AOA), perceptual attributes, action | Dutch | Shao, Z., Roelofs, A., & Meyer, A. S. (2014). Predicting naming latencies for action pictures: Dutch norms. Behavior research methods, 46(1), 274-283. |
http://link.springer.com/article/10.3758/s13428-014-0488-5 | French object norms | Manipulability, graspability and pantomimability norms by French speakers for 560 photographs | iconicity, manipulability, movability, perceptual attributes, graspability, semantics | non-linguistic | Guérard, K., Lagacé, S., & Brodeur, M. B. (2014). Four types of manipulability ratings and naming latencies for a set of 560 photographs of objects. Behavior research methods, 47(2), 443-470. |
http://language.psy.auckland.ac.nz/austronesian/ | Austronesian Basic Vocabulary Database | 210 vocabulary items in almost 1000 Austronesian languages | basic vocabulary, dictionary, Austronesian | Austronesian | Greenhill, S. J., Blust, R., & Gray, R. D. (2008). The Austronesian basic vocabulary database: from bioinformatics to lexomics. Evolutionary bioinformatics online, 4, 271-283. |
http://language.psy.auckland.ac.nz/bantu/ | Bantu Basic Vocabulary Database | 430 vocabulary items from 10 Bantu languages | basic vocabulary, dictionary, Bantu | Bantu | |
http://ielex.mpi.nl/ | Indo-European Lexical Cognacy Database | 207 vocabulary items in 150 Indo-European languages | basic vocabulary, dictionary, Indo-European, cognates | Indo-European | |
http://multitree.org/ | MultiTree: A digital library of language relationships | Resource for language relatedness and genealogy; contains trees for many language families | language family, genealogy, linguistic history, reconstruction, protolanguage | multilingual | |
https://sites.google.com/site/referencelexicon/ | RefLex - Reference Lexicon | Around 60,000 lexical entries for around 500 African languages with phonotactic and cognacy coding | Africa, cognacy, basic vocabulary, dictionary, language history | multilingual | |
http://phonotactics.anu.edu.au/ | ANU World Phonotactics Database | Phonotactic data for over 2000 languages and segmental data for around 4700 languages | typology, phonemes, phoneme inventory, phonology, phonotactics, segments | multilingual | Donohue, M., Hetherington, R., McElvenny, J., & Dawson, V. (2013). World phonotactics database. Department of Linguistics, The Australian National University. |
http://www.worldvaluessurvey.org/wvs.jsp | World Value Survey | Data on socioeconomic and demographic variables, including language background, for over 85,000 respondents in 57 countries | language use, bilingualism, demographic data | multilingual | World Values Survey Association (2009). World Values Survey 1981-2008 Official Aggregate v. 20090901. Madrid: ASEP/JDS. |
http://www.mirjamernestus.nl/Ernestus/NCCFr/ | Nijmegen Corpus of Casual French | 35 hours of orthographically annoted high-quality recordings with 46 French speakers conversing among friends. | phonetics, video, speech, annotated corpus | French | Torreira, F., Adda-Decker, M., & Ernestus, M. (2010). The Nijmegen Corpus of Casual French. Speech Communication, 52, 201-221. |
http://www.cstr.ed.ac.uk/projects/unisyn/ | Unisyn Lexicon | Multi-accent dictionary of English | English dialects, lexicon, accents | English | Fitt, S. (2002). Unisyn lexicon release. The Center for Speech Technology Research, University of Edinburgh. |
http://homepages.inf.ed.ac.uk/korin/sitenew/Research/Combilex/index.html | Combilex speech technology lexicon | Multi-accent dictionary of English | English dialects, lexicon, accents | English | Fitt, S., & Richmond, K., & Clark, R. Combilex. |
http://www.mngu0.org/ | mngu0 dataset | Articulatory EMA, MRI, video, audio and 3D scan data frome one British male speaker | articulation, articulatory data, MRI, EMA, video, phonetics, speech production | English | Richmond, K. (2011). Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech, pages 1505-1508, Florence, Italy, August 2011.; Steiner, I., Richmond, K., Marshall, I., & Gray, C. D. (2012). The magnetic resonance imaging subset of the mngu0 articulatory corpus. Journal of the Acoustical Society of America, 131(2), 106-111. |
http://link.springer.com/article/10.3758/BF03200831 | Touch-related adjective norms | 306 words that are categorized for various haptic properties such as roughness and weight | perceptual attributes, semantics, feeling | English | Stadtlander, L. M., & Murdoch, L. D. (2000). Frequency of occurrence and rankings for touch-related adjectives. Behavior Research Methods, Instruments, & Computers, 32(4), 579-587. |
http://wordbank.stanford.edu/ | MacArthur-Bates Communicative Devleopment Inventories (MCDI) | Database of children's early vocabulary development and gestures | developmental, language acquisition, early vocabulary, age of acquisition (AOA), gesture, multimodal | English, Danish, Norwegian, Turkish, Spanish, Russian, Mandarin, Swedish, German, Cantonese, Italian, Croatian, Hebrew | JĂžrgensen, R. N., Dale, P. S., Bleses, D., & Fenson, L. (2010). CLEX: A cross-linguistic lexical norms database. Journal of child language, 37(02), 419-428. |
http://www.iphod.com/ | Irvine Phonotactic Online Dictionary (IPhOD) | Collection of English words and pseudowords with respect to number of phonological variables | phonotactics, biphoneme probability, bigram probability, triphoneme probability, trigram probability, segments, phonemes, syllables | English | Vaden, K.I., Halpin, H.R., Hickok, G.S. (2009). Irvine Phonotactic Online Dictionary, Version 2.0. |
http://st2.ullet.net/ | StressTyp2 | Typological database with stress and accent patterns 750 languages | typology, stress, accent | multilingual | |
http://phonology.cogsci.udel.edu/dbs/stress/ | UD Phonology Lab Stress Pattern Database | Dominant stress patterns of the world's languages | typology, stress, accent | multilingual | |
http://languagelink.let.uu.nl/anatyp/ | Anaphora Typology Database | Anaphora database with example sentences | typology, anaphora, syntax | multilingual | Dimitriadis, A., Everaert, M., Reinhart, T., & Reuland, E. (2005). Anaphora Typology Database. |
http://languagelink.let.uu.nl/fpps/ | Free Personal Pronoun System database | Personal pronoun system's of the worlds languages | typology, syntax, morphology, personal pronoun, morphosyntax | multilingual | |
http://reduplication.uni-graz.at/ | Graz Database on Reduplication | Database that contains reduplication patterns of the world's languages | typology, reduplication, syntax, morphology, morphosyntax | multilingual | Hurch, B. (2005-). Graz Database on Reduplication. http://reduplication.uni-graz.at/ |
http://www.personal.uni-jena.de/~mu65qev/tdir/ | Typological Database of Intensifiers and Reflexives | Database that contains intensifiers and reflexive patterns of the world's languages | typology, intensifier, reflexives, syntax | multilingual | Gast, V., D. Hole, E. König, P. Siemund, S. Töpper (2007). Typological Database of Intensifiers and Reflexives. Version 2.0. http://www.tdir.org. |
http://web.phonetik.uni-frankfurt.de/upsid.html | UCLA Phonological Segment Inventory Data (UPSID) | Contains phonological inventories for 451 languages | phonology, phoneme inventory, typology | multilingual | Maddieson, I. (1984). Patterns of sounds. Cambridge studies in speech science and communication. Cambridge: Cambridge University Press. |
http://apics-online.info/ | Atlas of Pidgin and Creole Language Structures (APiCS) | Grammatical and lexical structures of 75 pidgin and creole languages | phonology, lexicon, negation, syntax, morphology, morphosyntax, typology, pidgin & creole languages, word order | multilingual | Michaelis, S. M.,Maurer, P.,Haspelmath, M., & & Huber, M. (eds.) (2013). Atlas of Pidgin and Creole Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. |
http://valpal.info/ | Valency Patterns Leipzig Online Database (ValPal) | Valency patterns of 36 languages | typology, syntax | multilingual | Hartmann, I., Haspelmath, M., & Taylor, B. (Eds.) (2013). Valency Patterns Leipzig. Leipzig: Max Planck Institute for Evolutionary Anthropology. |
http://ewave-atlas.org/ | Electronic World Atlas of Varieties of English | Over 235 linguistic features mapped for 50 varieties of English | English dialects, phonology, lexicon, morphology, syntax, discourse, word order, tense, aspect | multilingual | Kortmann, B., & Lunkenheimer, K. (Eds.) (2013). The Electronic World Atlas of Varieties of English. Leipzig: Max Planck Institute for Evolutionary Anthropology. |
http://lingweb.eva.mpg.de/numeral/ | Numeral System's of the World's languages | Data on numeral systems for about 4000 languages of the world | typology, numeral systems | multilingual | |
http://afbo.info/ | A world-wide survey of affix borrowing | A database of 101 languages where affixes have been borrowed (total of 657 affixed) | typology, affixes, morphology, morphosyntax | multilingual | Seifart, F. (2013). AfBO: A world-wide survey of affix borrowing. Leipzig: Max Planck Institute for Evolutionary Anthropology. |
http://sails.clld.org/ | South American Indigenous Language Structures (SAILS) | A database of 604 linguistic features from 167 American Indigenous languages | typology, syntax, morphology, morphosyntax, phonology, tense, aspect, evidentiality, word order, agreement | multilingual | Muysken, Pieter, Harald Hammarström, Olga Krasnoukhova, Neele MĂŒller, Joshua Birchall, Simon van de Kerke, Loretta O'Connor, Swintha Danielsen, Rik van Gijn & George Saad. 2014. South American Indigenous Language Structures (SAILS) Online. Leipzig: Online Publication of the Max Planck Institute for Evolutionary Anthropology. (Available at http://sails.clld.org) |
http://www.homophone.com/ | Homophone.com | Informal list of English homophones | homophones, lexicon | multilingual | |
http://intelligencesquaredus.org/ | Intelligence Squared | Political debates with transcripts and votes by audience members | politics, debate, argument, corpus | English | |
http://link.springer.com/article/10.3758/s13428-014-0552-1 | Nencki Affective Word List (NAWL) for Polish | Emotional valence, arousal and imageability ratings for 2,902 Polish words | emotion, valence, arousal, positive, negative, imageability, word frequency, part of speech (POS), word length | Polish | Riegel, M., Wierzba, M., Wypych, M., Ć»urawski, Ć., JednorĂłg, K., Grabowska, A., & Marchewka, A. (2015). Nencki Affective Word List (NAWL): the cultural adaptation of the Berlin Affective Word ListâReloaded (BAWL-R) for Polish. Behavior research methods, 1-15. |
http://link.springer.com/article/10.3758/s13428-011-0059-y | Discrete emotion norms for German (DENN-BAWL) | Discrete emotion ratings for for about 2000 German nouns | emotion, valence, arousal, positive, negative | German | Briesemeister, B. B., Kuchinke, L., & Jacobs, A. M. (2011). Discrete emotion norms for nouns: Berlin affective word list (DENNâBAWL). Behavior research methods, 43(2), 441-448. |
http://www.lehoa.macmate.me/MelissaVo/BAWL-R.html | Berlin Affective Word List Reloaded (BAWL-R) | Emotional arousal and valence ratings for about 2900 German nouns | emotion, valence, arousal, positive, negative | German | VĂ”, M. L., Conrad, M., Kuchinke, L., Urton, K., Hofmann, M. J., & Jacobs, A. M. (2009). The Berlin affective word list reloaded (BAWL-R). Behavior research methods, 41(2), 534-538. |
http://talkbank.org/SLA/ | Second Language Acquisition Resources | Contains transcribed corpora with audio from several languages relevant for second language acquisition research | second language acquisition (SLA), L2, bilingualism | multilingual | |
http://childes.psy.cmu.edu/phon/ | PhonBank Database for Phonological Development | Contains corpora and phonological information on child language development | language development, language acquisition, phonetics, phonology, clinical corpora | English, French, Portuguese, German, Swedish, Dutch, Indonesian, Japanese, Taiwanese, Cantonese, Greek, Arabic, Berber, Romanian, Polish | |
http://childes.psy.cmu.edu/ | Child Language Data Exchange System (CHILDES) | Transcribed and annotated child language corpora for several languages | language development, language acquisition, annotated corpus, corpora, clinical corpora | Celtic, Irish, Welsh, Cantonese, Chinese, Indonesian, Japanese, Korean, Taiwanese, Thai, English, Afrikaans, Dutch, Danish, German, Icelandic, Norwegian, Swedish, Catalan, Spanish, French, Italian, Portuguese, Romanian, Croatian, Polish, Russian, Serbian, Slovenian | |
http://childfreq.sumsar.net/ | ChildFreq: CHILDES frequency tool | Online access tool to CHILDES word frequency data | language development, language acquisition, word frequency | English | Baath, R. (2014). ChildFreq: An online tool to explore word frequencies in child language. |
http://www.opensubtitles.org/en/search | Open Subtitles | Over three million subtitle files for data from several languages | subtitles, corpus, parallel corpus | multilingual | |
http://corpus.quran.com/ | Quranic Arabic Corpus | Morphological annotation, syntactic treebank and semantic ontology for the entire Holy Quran | Quran, corpus, treebank, morphological annotation, syntax, semantics, ontology | Arabic | |
http://www.alc.manchester.ac.uk/subjects/lel/research/projects/archer/ | ARCHER: A Representative Corpus of Historical English Registers | A multi-genre English corpus ranging from 1600 to 1999 | language history, historical corpus, registers | English | |
https://perswww.kuleuven.be/~u0044428/clmet3_0.htm | CLEMET: Corpus of Late Modern English Texts | 34 million words of running text from 1710 to 1920 | language history, historical corpus, registers | English | Diller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35. |
http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/ | Helsinki Corpus of English Texts | Contains 1.5 million words from English texts ranging from 730 AD to 1710 AD | Old English, Middle English, Early Modern English, language history, historical corpus | English | |
http://www.helsinki.fi/varieng/CoRD/corpora/HCOS/index.html | Helsinki Corpus of Older Scots (HCOS) | Contains 0.8 million words of Scottish English from 1450 AD to 1700 AD | Scottish, language history, historical corpus | English | The Helsinki Corpus of Older Scots (1995). Department of Modern Languages, University of Helsinki. Compiled by Anneli Meurman-Solin. |
http://www-users.york.ac.uk/~sp20/corpus.html | Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English | Syntactically annotated and POS-tagged corpus of Old English | Old English, language history, historical corpus, syntax | English | |
http://www-users.york.ac.uk/~lang22/YcoeHome1.htm | York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) | A corpus of 1.5 million words of Old English texts, syntactically annotated and POS-tagged | Old English, prose, language history, historical corpus, syntax | English | |
http://www-users.york.ac.uk/~lang18/pcorpus.html | York-Helsinki Parsed Corpus of Old English Poetry | Corpus of Old English poetry, syntactically annotated and POS-tagged | Old English, poetry, language history, historical corpus, syntax | English | |
http://www.ling.upenn.edu/hist-corpora/ | Penn Corpora of Historical English (PPCME2, PPCEME, PPCMBE) | Middle English, Early Modern English and Modern English corpora, syntactically annotated and POS-tagged | Middle English, Early Modern English, language history, historical corpus, syntax | English | |
http://ota.ox.ac.uk/ | The University of Oxford Text Archive (OTA) | Text archives (with some audio and video data) for lots of English texts from many different time periods | historical corpus, text archive, corpora, language history | English | |
http://www.helsinki.fi/varieng/domains/CEEC.html | Corpus of Early English Correspondence (CEEC) | Compiled with historical sociolinguistics in mind, a more than 6 million word corpus of English correspondences (1410-1800) from thousands of writers | historical corpus, letters, language history, corrspondence, sociolinguistics | English | |
https://www.tu-chemnitz.de/phil/english/sections/linguist/real/independent/lampeter/lamphome.htm | The Lampeter Corpus of Early Modern English Texts | English texts from 1640 to 1740 within the categories religion, politics, economy, science, law and miscellaneous | historical corpus, language history, Early Modern English | English | |
http://www.comp.leeds.ac.uk/eric/latifa/research.htm | Corpus of Contemporary Arabic (CCA) | A corpus of 0.8 million Arabic words | corpus | Arabic | |
http://www.thelatinlibrary.com/ | The Latin Library | A collection of Latin texts from several authors | corpus | Latin | |
http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/nowac/index.html | Norwegian Web as Corpus (NoWaC) | A web-based corpus of 700 million Norwegian words | corpus | Norwegian | Guevara, Emiliano Raul (2010). NoWaC: a large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, Association for Computational Linguistics, 1 - 7. |
http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html | TIGER Corpus | German news corpus of 0.9 tokens from the Frankfurter Rundschau | corpus, news, media | German | Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., & Uszkoreit, H. (2004). TIGER: Linguistic Interpretation of a German Corpus. Journal of Language and Computation, 2004 (2), 597-620. |
http://www.moderna.uu.se/slaviska/ryska/corpus/ | Uppsala Russian Corpus | Russian corpus of different genres with transliterations | corpus | Russian | |
http://www.arabiclearnercorpus.com/ | Arabic Larner Corpus (ALC) | 0.2 million Arabic words produced from 942 students from 66 different L1 backgrounds | learner corpus, second language acquisition (SLA), bilingualism | Arabic | |
http://www.unicaen.fr/gazette/ | La Gazette de Renaudot | Historical corpus of French gazettes/newspapers | historical corpus, language history | French | |
http://www.engelska.uu.se/Research/English_Language/Research_Areas/Electronic_Resource_Projects/USE-Corpus/?languageId=1 | Uppsala Student English Corpus (USE) | Corpus of 1500 essays written by 440 Swedish university students | learner corpus, second language acquisition (SLA), bilingualism | Swedish | |
http://www.engelska.uu.se/Research/English_Language/Research_Areas/Electronic_Resource_Projects/A_Corpus_of_English_Dialogues/ | Corpus of English Dialogues (CED) | Corpus of 1.1 million words of English dialogues (spoken interactions) from 1560-1760 | historical corpus, dialogue, language history, corpora, Early Modern English | English | |
http://www.gutenberg.org/ | Project Gutenberg | A collection of 50000 free ebooks | book collection, text archive | English | |
http://www.nytimes.com/ref/membercenter/nytarchive.html | New York Times Article Archive | A collection of all New York Times articles starting with 1851 to present | text archive, news, media, newspaper | English | |
http://link.springer.com/article/10.3758/BF03195349 | Bird Age of Acquisition and Imageability ratings | Imageability and age of acquisition norms for a set of 2645 English words | age of acquisition (AOA), imageability | English | |
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html | Opinion Lexicon | A list of 6800 positive and negative English opinion words | opinion mining, sentiment analysis, emotional valence, positive, negative | English | |
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html | Amazon Product Review Data | More than 5.8 million reviews of Amazon products | corpus, media, Amazon, reviews | English | |
http://www.yelp.com/dataset_challenge | Yelp Challenge Review Data | About 1.6 million reviews from 360000 Yelp users | corpus, media, Yelp, reviews | English | |
http://sentiwordnet.isti.cnr.it/ | SentiWordNet | A lexical resource for opinion mining | emotional valence, affect, opinion mining, sentiment analysis, wordnet, positive, negative | English | |
http://dialect.topography.chass.utoronto.ca/dt_atlas.php | Atlas of Dialect Topograhy | Cross-regional dialect topography, largely focused on Canada | sociolinguistics, dialects, Canada, Canadian English, regional variants | English | |
http://austlang.aiatsis.gov.au/disclaimer.php | AUSTLANG: Australian Indigeneous Languages Database | Classification and language information on Australian languages, including maps | Australian languages, classification, geography, map, speaker information, language use | multilingual | |
http://www.baydat.uni-wuerzburg.de:8080/cocoon/baydat/baydat | BayDat: Bayrische Dialektdatenbank | Database of Bavarian German dialects | dialects, Bavaria, Germany, sociolinguistics, map | German | |
http://lacito.vjf.cnrs.fr/pangloss/ | La Collection Pangloss | Database of audio materials from several of the world's languages | audio recordings, typology, world's languages | multilingual | |
http://corpus.byu.edu/time/ | Time Magazine Corpus | 100 million word corpus of TIME magazine | corpus, media, news, magazine | English | |
http://corpus.byu.edu/wiki/ | Wikipedia corpus | 1.9 billion word corpus from Wikipedia (4.4 million articles) | corpus, Wikipedia | English | |
http://corpus.byu.edu/glowbe/ | Corpus of Global Web-Based English (GloWbE) | 1.9 billion word corpus from 1.8 million web pages | corpus, web language, web corpus | English | |
http://corpus.byu.edu/can/ | Corpus of Canadian English (STRATHY) | 50 million word corpus of Canadian English ranging from 1920 to 2000 | corpus, Canada, historical corpus, language history | English | |
http://www.corpusdelespanol.org/ | CORPUS DEL ESPAĂOL | 100 million word corpus from 20000 Spanish texts spanning a time range from 1200 to the 1900s | corpus, language history, historical corpora | Spanish | |
http://www.corpusdoportugues.org/ | O CORPUS DO PORTUGUES | 45 million word corpus of Portuguese spanning from 1300 to 1900 | corpus, language history, historical corpora | Portuguese | |
http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php | Digital Corpus of Sanskrit (DCS) | 3.2 million words of Sanskrit with collocates | corpus, language history, historical corpora | Sanskrit | |
http://xtone.linguistics.berkeley.edu/about.php | Cross-Linguistic Tonal Database (Xtone) | Information on lexical tone from 82 different languages | typology, phonology, lexical tone, tonal systems | multilingual | |
http://tls.uni-hd.de/home_en.lasso | Thesaurus Linguae Sericae | A historical and comparative encyclopedia of Chinese conceptual schemes, with corpora and semantic relations | language history, historical corpus, semantic relations, encyclopedia, historical phonology | Chinese | |
http://sealang.net/assam/ | Tai and Tibeto-Burman language corpora | Transcribed and translated texts of Tibeto-Burman languages | Tibeto-Burman, corpus, corpora, endangered languages | Ahom, Aiton, Khamti, Khamyang, Singpho, Turung, Tangsa | |
http://turing.iis.sinica.edu.tw/treesearch/ | Sinica Treebank | Parsed corpus of 360000 Chinese words | treebank, corpus, syntax | Chinese | |
http://romani.humanities.manchester.ac.uk/rms/ | Romani Morpho-Syntax Database (RMS) | Database of linguistic features of Romani | grammatical database, syntax, morphology | Romani | |
http://www.gaois.ie/crp/en/ | Parallel English-Irish corpus of legal texts | 4.5 English words of legal texts with Irish translations | parallel corpus, corpus, translation, law, legal texts | English, Irish | |
http://www.uni-stuttgart.de/lingrom/stein/corpus/ | Le Nouveau Corpus d'Amsterdam | Old French literary texts between 11th and 14th century | language history, historical corpus, Old French | French | |
http://www.meertens.knaw.nl/nfb/ | Nederlandse Familienamenbank | 300000 Dutch names and their locations in the Netherlands | onomastics, names, geography | Dutch | |
http://www.livac.org/ | LIVAC Synchronous Corpus | 550 million word Chinese corpus | corpus | Chinese | |
http://www.cfilt.iitb.ac.in/indowordnet/ | IndoWordNet | A wordnet of the languages of India | wordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchies | Hindi, Assamese, Bengali, Bodo, Gujarati, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, Telugu, Urdu | |
http://www.cfilt.iitb.ac.in/wordnet/webhwn/ | Hindi WordNet | A wordnet of Hindi | wordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchies | Hindi | |
http://hebrewcorpus.nmelrc.org/ | HebrewCorpus | A 150 million word corpus of Hebrew | corpus | Hebrew | |
http://www.sfs.uni-tuebingen.de/GermaNet/ | GermaNet | A wordnet of German | wordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchies | German | |
http://argyf.fryske-akademy.eu/files/tdb/ | Frisian Languages Database | Frisian database containing audio and written corpora, including historical ones | historical corpus, spoken corpus, audio, language history, Germanic languages | Frisian | |
http://eap.bl.uk/ | Endangered Archives Programme | A text archive of endangered languages | text archive, endangered languages | multilingual | |
http://www.panlex.org/ | PanLex | Vocabulary and translations for 21 million expressions in about 10,000 language varieties, including Swadesh lists for about 2000 language varieties | vocabulary, word list, dictionary, Swadesh list | multilingual | |
http://www.lmp.ucla.edu/ | UCLA Language Materials Project | Contains teaching and learning materials for over 150 less commonly taught languages, including speaker and other information about the languages | speaker information, bilingualism, language use, demographic data | multilingual | |
http://buckeyecorpus.osu.edu/ | The Buckeye Speech Corpus | Corpus of high-quality recordings from 40 speakers in Columbus, Ohio, orthographically transcribed and phonetically labelled | audio corpus, annotated corpus, phonetics, phonology, speech | English | |
http://groups.inf.ed.ac.uk/switchboard/index.html | The Switchboard Corpus in NXT | Updated annotations of the Switchboard corpus of telephone conversations, annotated | annotated corpus, prosody, syntax, speech, conversational speech, telephone conversation | English | |
https://catalog.ldc.upenn.edu/LDC96L14 | CELEX2 Corpus | Corpus of English, Dutch and German with additional lexical information | corpus, word frequency | English, Dutch, German | |
https://catalog.ldc.upenn.edu/LDC93S1 | TIMIT Acoustic-Phonetic Continuous Speech Corpus | Audio corpus of 630 speakers of eight American English dialects with time-aligned orthographic, phonetic, and word transcriptions | annotated corpus, speech, phonetics, phonology, audio corpus, English dialects | English | |
http://demeter.inf.ed.ac.uk/cross/publications.html | Twitter FSD First Story Detection Corpus | Corpus of \first stories\ (new events) from twitter | corpus, web language, social media, web corpus | English | |
http://clic.cimec.unitn.it/amac/twitter_ngram/ | Rovereto Twitter N-Gram Corpus | N-grams (up to 6-grams!) for 75 million English tweets | corpus, ngrams, word frequency, web language, social media | English | |
http://trec.nist.gov/data/tweets/ | Tweets2011 corpus | A corpus of tweets collected January and February 2011 | corpus, web language | English | |
http://demeter.inf.ed.ac.uk/cross/publications.html | Newswire FSD First Story Detection Corpus | A corpus of \first stories\ (new events) from newswire | corpus, web language, newspaper, media, political language | English | |
http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=Repubblica | La Repubblica Corpus | A corpus of 380 million tokens of Italian newspaper texts, POS-tagged, lemmatized and genre categorized | corpus, genre, topic, syntax, part of speech (POS), newspaper, media, political language | Italian | |
http://www.nzilbb.canterbury.ac.nz/onze.shtml | Origins of New Zealand English (ONZE) Corpus | A corpus of various stages of New Zealand English | audio corpus, phonetics, phonology, language history, historical corpus | English | |
http://laslab.org/resources/confusions/ | Corpus of noise-induced Spanish misperceptions/confusions | A corpus of 3235 noise-induced robust misperceptions in Spanish | corpus, phonetics, phonology, speech perception | Spanish | |
http://www.cs.cmu.edu/~mfaruqui/suite.html | WordSim353 evaluation benchmarks | Human similarity ratings for over 3000 word pairs, including syntactic relations | semantics, similarity, semantic relatedness | English, German, French, Arabic, Romanian, Spanish | |
https://github.com/mfaruqui/non-distributional | Non-distributional English word vectors | Large lexicon with thesaurus, antonyms, color, connotations and valence information extracted through NLP procedures | semantics, lexicon, sentiment analysis, affect, emotional valence, antonyms | English | |
https://console.developers.google.com/storage/browser/wikipedia_multilingual_relations_v1/ | Semantic Relations from Wikipedia | A dataset of automatically extracted semantic relations from the multilingual Wikipedia corpus | semantics, semantic relations | French, Russian, Chinese, Arabic, Hindi, Indonesian, Tagalog, Latvian, Swahili, Georgian | |
http://www.nlpado.de/~sebastian/data/tv_data.shtml | Bilingual Formal/Informal Address Corpus | Corpus of English and German sentences from novels tagged for formal and informal connotations, tokenized, lemmatized, POS-tagged | annotated corpus, politeness, formal language | German, English | |
http://www.coli.uni-saarland.de/projects/salsa/corpus/ | German SALSA Corpus | A large frame-based lexicon for German with semantic roles | semantic roles, frames, framenet | German | |
http://www.nlpado.de/~sebastian/data/srl_data.shtml | Cross-lingual projection of semantic roles | Parallel corpora annotated for semantic roles | parallel corpus, corpus, translation, semantic roles | German, English | |
https://framenet.icsi.berkeley.edu/fndrupal/ | FrameNet | A lexical database of English that specifies semantic frames and semantic roles, more than 10000 senses | framenet, lexical database, dictionary, vocabulary, semantic relations | English | |
http://u.cs.biu.ac.il/~nlp/resources/downloads/annotation-of-discourse-references-relevant-for-entailment-inference/ | Discourse Reference Corpus | Pragmatically annotated corpus with information about coreference and bridging | reference, discourse, pragmatics, annotated corpus, entailment inference, coreference | English | |
http://clic.cimec.unitn.it/dm/ | Distributional Memory semantic database | Semantic database of English based on distributional information | lexicon, semantic relatedness, relations, corpus-based semantics, co-occurrence | English | |
http://www.cl.uni-heidelberg.de/~zeller/res/te-ger/index.mhtml | Textual Entailment Search Task Dataset for German | A corpus of 3000 text/hypothesis pairs derived from web forum posts | textual entailment, semantic inference, pragmatics, corpus, web language | English | |
http://www.ims.uni-stuttgart.de/permalink/56cc6c89-c421-11e4-a5e6-000e0c3db68b.html | DErivBase German Derivational Lexicon | A derivational lexicon for German | morphology, lexicon, dictionary, lemma | German | |
http://takelab.fer.hr/data/dmhr/ | Distributional Memory for Croatian | Semantic database of Croatian based on distributional information | lexicon, semantic relatedness, relations, corpus-based semantics | Croatian | |
http://www.cl.cam.ac.uk/~fh295/simlex.html | SimLex999 Semantic Relatedness Dataset | A dataset of dataset of normed semantic similarity (rather than just word associations) | semantics, semantic similarity, semantic relatedness, relations, concreteness, word association | English | |
http://www.kuleuven.be/semlab/interface/index.php | Dutch Word Associations | A dataset of word associations in Dutch | semantics, word association | Dutch | |
http://www.nltk.org/nltk_data/ | NLTK Corpora | Variety of corpora and datasets built into the NLTK python library | natural language processing, python, brown, australian broadcasting, alpino dutch treebank, treebank, CONLL, Europarl, Genesis, bible, gazeteer, C-Span, Gutenberg, KNB corpus, sentiment, NPS chat, opinion lexicon, multilingual wordnet, penn treebank, sentiwordnet | English, Portuguese, Spanish, Basque, Old English, Mandarin Chinese, Polish, Brazilian Portuguese | |
http://www.cstr.ed.ac.uk/research/projects/artic/accor.html | EUR-ACCOR | Cross-language recordings with EPG, laryngograph, nasal airflow, and audio | articulatory phonetics, articulation, speech production, Rhotenberg mask | Catalan, English, French, German, Irish Gaelic, Italian, Swedish | |
http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html | MOCHA-TIMIT | Dataset with audio, laryngograph and EMA recordings for English, constructed with the intention of training an automatic speech recognition system | articulation, articulatory phonetics, speech production, electromagnetic articulography, tongue recording | English | |
http://www.u.arizona.edu/~nwarner/WarnerMcQueenCutler.html | English Diphone Perceptual Database | Phoneme categorizations based on a gated listening task | speech perception, phonetics, phonology, psycholinguistics, phonetic information over time | English | |
http://www.mpi.nl/world/dcsp/diphones/ | Dutch Diphone Perceptual Database | A total of 488,520 phoneme categorizations based on a gated listening task of 1,179 Dutch diphones | speech perception, phonetics, phonology, psycholinguistics, phonetic information over time | Dutch | |
http://www.linguateca.pt/acesso/corpus.php?corpus=SAOCARLOS | NILC / San Carlos Corpora | Collection of corpora of contemporary Portuguese, with part of speech tags (POS-tagged) | corpus, annotated corpus | Portuguese | |
http://www.clul.ul.pt/pt/recursos/183-reference-corpus-of-contemporary-portuguese-crpc | CRPC Comparative Portuguese corpus | Large corpus containing texts from several varieties of Portuguese (European, Brazil, Angola, Cape Verde, Guinea-Bissau, Mozambique, Sao Tome and Principe, Goa, Macau, Timor-Leste) | corpus, dialectal corpus, sociolinguistics | Portuguese | |
http://cipm.fcsh.unl.pt/ | CIPM Medieval Portuguese corpus | Historical corpus of medieval Portuguese | historical corpus, language history, classical & medieval Portuguese | Portuguese | |
http://www.letras.ufrj.br/nurc-rj/ | NURC-RJ Spoken Portuguese Corpus | Spoken corpus of Brazilian Portuguese | spoken corpus, audio recordings, phonetics, phonology | Portuguese | |
http://www.letras.ufrj.br/laborhistorico/ | LaborHistorico Historical Portuguese corpus | Official historical corpus of the \A history of Brazilian Portuguese\ project | historical corpus, language history | Portuguese | |
https://sites.google.com/site/distributedlittleredhen/home | Distributed Little Red Hen Lab Databases | Resource directory for the UCLA NewsScape Library of International Television News; a TV News Archive that contains news programs | semantics, gesture, phonetics, corpus, television (TV), media, politics, news, multimodal corpus | English | |
http://spokenchinesecorpus.nccu.edu.tw/ | NCCU Corpus of Spoken Chinese | Spoken corpus of Chinese | spoken corpus, audio recordings, phonetics, phonology | Mandarin Chinese | |
http://andosl.anu.edu.au/andosl/ | Australian National Database of Spoken Language (ANDOSL) | Phonetically annotated spoken language corpus of Australian English | spoken corpus, audio recordings, phonetics, phonology, phonetically annotated | Australian English | |
http://projects.ael.uni-tuebingen.de/backbone/moodle/ | BACKBONE Pedagogic Corpus of Video-Recorded Interviews | Spoken interviews with video recordings for several European languages, including second language recordings | spoken corpus, multimodal corpus, video, second language acquisition (SLA), bilingualism | English, French, German, Polish, Spanish, Turkish | |
http://serverdbt.ilc.cnr.it/altweb/ | Atlante Lessicale Toscano (ATL Lexical Atlas of Tuscany) | Lexical atlas and demographic data; dialectal resource for Tuscan dialects in Italy | sociolinguistics, dialects, lexical atlas, language geography, dialectology, Italian dialects | Italian | |
https://catalog.ldc.upenn.edu/LDC2009T25 | Web 1T 5-gram ngrams for 10 European languages | N-grams (up to 5-grams) and frequency counts for 10 European languages | n-grams, word frequency, Google | Swedish, Spanish, Romanian, Portuguese, Polish, Dutch, Italian, French, German, Czech | |
http://www.let.rug.nl/gosse/bin/Web1T5_freq.perl | Web 1T 5-gram database for Dutch | N-grams and frequency counts for Dutch | n-grams, word frequency, Google | Dutch | |
http://rugtest16.service.rug.nl/gosse/Ngrams/ | Groningen Twitter Corpus | Dutch twitter corpus containing approximately 2.6 billion tweets and 28 billion tokens collected between January 2014 and December 2014, n-gram parsed up to 5-grams | n-grams, twitter, web corpus, web language, social media | Dutch | |
http://www.linguistics.ucsb.edu/research/santa-barbara-corpus | Santa Barbara Corpus of Spoken American English (SBCSAE) | 249,000 words with transcriptions, audio and timestamps | spoken corpus, audio recordings, phonetics, phonology, phonetically annotated | English | |
http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/IMS-GECO.en.html | IMS GECO Phonetic Convergence database | 46 dialogs (ca. 25 min long) between female German speakers, in speaker-visible and speaker-invisible contexts for the study of phonetic convergence | spoken corpus, audio recordings, phonetics, phonology, multimodal corpus, phonetic convergence, accommodation, interpersonal synchrony, sociolinguistics, sociophonetics | German | |
http://quod.lib.umich.edu/cgi/c/corpus/corpus?c=micase;page=mbrowse | MICASE Michigan Corpus of Academic Spoken English | 152 transcripts totaling 1.8 million words of academic spoken English | spoken corpus, university language, registers, formal language | English | |
http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm | Blog Authorship corpus | 681288 posts totaling 140 mio words from 19,320 bloggers, collected in 2004, balanced for gender; with age, gender and industry/occupation information | corpus, web language, social media, web corpus, demographic data, sociolinguistics | English | |
https://www.ausnc.org.au/ | AusNC Australian National Corpus | Collection of Australian English corpora (including ACE, ART, AusLit, Braided Channels, COOEE, GCSAusE, ICE Corpus, MD Corpus, Monash Corpus); includes many registers and different time periods and transcribed speech from sociolinguistic interviews with gender information | corpus, Australian English, dialects, spoken corpus, written language, literature, poetry, historical corpus, language history, varieties of English | English | |
http://taiccm.org/ | Taiwan Corpus of Child Mandarin (TCCM) | Taiwan Corpus of Child Mandarin (TCCM) | corpus, child language, language acquisition, L1, children, learner corpus | Chinese | |
http://link.springer.com/article/10.3758/BF03193116 | Wurm 2007 Danger and Usefulness Norms | A published research article that includes ratings for the danger and usefulness of words | semantics, danger, usefulness, semantic norms, meaning, perceptual attributes | English | |
http://link.springer.com/article/10.3758/BRM.40.1.183#page-1 | Semantic Feature Production Norms | Semantic feature production norms for a 456 words (objects and events) | semantic features, semantics, feature norms, distinctive features, objects and events, properties | English | |
http://www.neuro.mcw.edu/ratings/ | Wisconsin Perceptual Attribute Ratings Database (MCWisc) | Perceptual attribute norms for four sensory domains: sound, color, manipulation, motion; for 1402 words, including emotion ratings reflecting intensity and valence | perceptual attributes, manipulability, semantics, concepts, perception, manipulability, valence, affect, feeling | English | |
http://link.springer.com/article/10.3758/BF03195584 | Extension of Paivio norms | Extension of Paivio et al. (1968) lexical norms | gender ladenness, sexual language, stereotypes, age of acquisition (AOA), number of meanings, number of associates, emotionality, pleasantness, emotional valence, children's dictionaries, concreteness, meaningfulness, goodness, word frequency, imagery, imageability, language acquisition, children's word knowledge, lexical knowledge, word knowledge | English | |
http://sumale.vjf.cnrs.fr/pronoms/ | Les marques personnelles dans les languages africaines | Database of personal pronouns of African languages | typology, morphosyntax, morphology, syntax, personal pronouns, person marking | multilingual | |
http://typo.uni-konstanz.de/archive/intro/ | The Konstanz Universals Archive | A list of proposed typological universals | language typology, Greenbergian universals, morphology, syntax, morphosyntax, word order | multilingual | |
http://typo.uni-konstanz.de/rara/intro/index.php | Das grammatische RaritÀtenkabinett | Informal list of grammatical rarities / typologically rare features | typology, universals, rare features, syntax, morphosyntax | multilingual | |
http://www.soundcomparisons.com | Sound comparisons | Comparative atlas and map with audio samples of Germanic, Romance, Slavic, Celtic, Andean and Mapudungun | language geography, comparative linguistics, dialectology, Indo-European languages, pronunciation, sound patterns, phonetics, phonology, cognates, cognacy | multilingual, including English, German, French, Italian, Spanish and Portuguese | |
http://langscape.umd.edu/ | Langscape | World map / linguistic atlas showing the location of languages and visualizing linguistic diversity across the globe | language geography, linguistic diversity, typology, map, endangered languages, atlas | multilingual | |
http://sswl.railsplayground.net/ | SSWL Syntactic Structures of the World's Languages | Typological database with syntactic features for 250+ languages of the world | typology, morphology, morphosyntax, syntax, word order, universals | multilingual | |
http://sealang.net/monkhmer/dictionary/ | SEAlang Mon-Khmer Etymological Dictionary | Dictionary for comparative and historical linguistics of Mon-Khmer languages | etymology, dictionary, lexical data, language history, historical linguistics, phylogenetics, comparative dictionaries, Asian languages | multilingual | |
http://pollex.org.nz/ | Polynesian Lexicon Project (Pollex Online) | Large-scale comparativ dictionary of Polynesian languages | Polynesian, Austronesian, lexical data, comparative dictionary, cognacy, cognates, historical linguistics, Pacific languages, word lists | multilingual | |
http://transnewguinea.org/ | TransNewGuinea.org | Database of languages from the Trans-New Guinea family and friends, encompassing 900+ languages and info on 1000+ words | Pacific languages, Trans-New Guinea family (TNG), Papua New Guinea (PNG), language history, historical linguistics, linguistic diversity, Austronesian | multilingual | |
http://starling.rinet.ru/new100/main.htm | The Global Lexicostatistical Database (GLD) | Basic word lists for many of the world's languages | comparative linguistics, historical linguistics, phylogenetics, lexicostatistics, basic vocabulary, word lists, Swadesh list | multilingual | |
http://www.lapsyd.ddl.ish-lyon.cnrs.fr/ | Lyon-Albuquerque Phonological Systems Database (LAPSyD) | Searchable database of basic phonological information on a wide sample of the world's languages | phonological typology, phoneme inventory, phonology, phonetics, consonants, vowels, syllable structure, linguistic stress, lexical tones | multilingual | |
https://doi.org/10.3758/s13428-018-1099-3 | The Glasgow Norms | normative ratings for 5,553 English words on nine psycholinguistic dimensions: arousal, valence, dominance, concreteness, imageability, familiarity, age of acquisition, semantic size, and gender association | English | Scott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S.C. (2018). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51, 1258â1270. | |
https://www.cogsci.mq.edu.au/research/resources/nwdb/nwdb.html | ARC Nonword Database | 358,534 nonwords | Rastle, K., Harrington, J., & Coltheart, M. (2002). 358,534 nonwords: The ARC Nonword Database. Quarterly Journal of Experimental Psychology, 55A, 1339-1362. | ||
https://lingualab.ca/en/project/norms-familiarity-perceptual-strength | French Canadian conceptual familiarity norms | 3,596 nouns and online data about them from 313 Canadian French speakers | French | ||
https://smallworldofwords.org/en/project | Small World of Words | Word association and participant data for 100 primary, secondary and tertiary responses to 12,292 cues in English, 12,571 cues in Dutch | Dutch |