Linguistics Stimuli

The following is forked with permission from (and almost identical to) The Language Goldmine.

URLTitleDescriptionTagsLanguagesAssociated Publication
http://concepticon.clld.org/ConcepticonLinks 9611 concepts from 51 different concept lists to 2206 different concept sets, 243 relations between concepts are definedsemantics, concepts, lexicon structure, vocabularymultilingual
http://clics.lingpy.org/Database of Cross-Linguistic ColexificationsGives polysemy information for 221 different languages covering 64 families (more than 300000 words and 10000 concepts)semantics, concpts, polysemy, lexicon structure, vocabulary, typologymultilingualList, J.-M., Terhalle, A., & Urban, M. (2013). Using network approaches to enhance the analysis of cross-linguistic polysemies. Proceedings of the 10th International Conference on Computational Semantics (pp. 347-353). Association for Computational Linguistics.
https://archive.org/details/tvThe TV News ArchiveContains more than 705,000 captioned and searchable news programs from over 4 years of U.S. television networkssemantics, gesture, phonetics, corpus, TV, media, politics, news, multimodal corpusEnglish, Spanish
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nlSUBTLEX-NLDutch word frequencies based on 44 million words from film and television subtitlesword frequency, contextual diversityDutchKeuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650.
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-chSUBTLEX-CHChinese word frequencies based on 33.5 million words from film and television subtitlesword frequency, part of speech (POS), lexical decision task, reaction times (RT), response latencyChineseCai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://www.bcbl.eu/databases/subtlex-gr/SUBTLEX-GRModern Greek word frequencies based on 23 million words from film and television subtitlesword frequency, orthographic neighborhood density, orthgraphic levensthein distance, contextual diversityGreekDimitropoulou, M., Duñabeitia, J., Avilés, A., Corral, J.& Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behaviour: the case of Greek.Frontiers in Psychology, 1:218, 1-12.
http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-plSUBTLEX-PLPolish word frequencies based on 101 million words from film and television subtitlesword frequencyPolishMandera, P., Keuleers, E., Wodniecka, Z., & Brysbaert, M. (2014). Subtlex-pl: subtitle-based word frequency estimates for Polish. Behavior research methods, 47(2), 471-483.
http://crr.ugent.be/archives/1423SUBTLEX-UKBritish English word frequencies based on 201.3 million words from 45,099 BBC broadcastsword frequency, contexutal diversity, word frequency in childrens programs, part of speech (POS), bigram frequenciesEnglishVan Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190.
http://crr.ugent.be/archives/534SUBTLEX-DEGerman word frequencies of 25.4 million words from film and television subtitlesword frequencyGermanBrysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental Psychology, 58, 412-424
http://www.dlexdb.de/Digitales Woerterbuch der deutschen Sprache (dlexDB)Over 100 million German word tokens, neighborhood densities and bigram and trigram probabilities based on different registersword frequency, bigram probability, trigram probability, neighborhood density, conditional probabilityGermanHeister, J., Wuerzner, K. M., Bubenzer, J., Pohl, E., Hanneforth, T., Geyken, A., & Kliegl, R. (2011). dlexDB-A lexical database for the psychological and linguistic research. Psychologische Rundschau, 62(1), 10-20.
http://wortschatz.uni-leipzig.de/Leipzig Wortschatz LexiconGerman thesaurus and lexical networkthesaurus, lexical networkGerman
http://corpora2.informatik.uni-leipzig.de/Leipzig Corpus CollectionContains frequencies and co-occurrence information for 219 languagesword frequency, corpusmultilingualQuasthoff, U., Richter, M., Biemann, C. (2006). Corpus Portal for Search in Monolingual Corpora. Proceedings of the fifth international conference on Language Resources and Evaluation, LREC 2006, Genoa, pp. 1799-1802.
http://crr.ugent.be/archives/806Kuperman English Age-of-acquisition ratingsAge-of-acquisition ratings for 30,000 English words.age of acquisition (AOA)EnglishKuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978-990.
http://crr.ugent.be/archives/1003Warriner English Affective RatingsValence, arousal and dominance ratings for 13,915 English wordsemotion, valence, dominance, arousal, affect, positive, negativeEnglishWarriner, A.B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45, 1191-1207.
http://crr.ugent.be/archives/1330Brysbaert English Concreteness RatingsConcreteness ratings for 40,000 English wordsconcretenessEnglishBrysbaert, M., Warriner, A.B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904-911.
http://crr.ugent.be/archives/1602Brysbaert Dutch Age-of-acquisition & Concreteness ratingsAge-of-acquisition and concreteness ratings for 30,000 Dutch wordsconcreteness, age of acquisition (AOA)DutchBrysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80-84.
http://crr.ugent.be/archives/878Moors Dutch Affective RatingsValence, arousal and dominance ratings for 4,300 Dutch wordsemotion, valence, dominance, arousal, affect, positive, negativeDutchMoors, A., De Houwer, J., Hermans, D., Wanmaker, S., van Schie, K., Van Harmelen, A. L., De Schryver, M., De Winne, J., & Brysbaert, M. (2013). Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words. Behavior research methods, 45(1), 169-177.
https://sites.google.com/site/frenchlexicon/resultsFrench Lexicon ProjectLexical decision data for 38,840 French words and 38,840 pseudowordslexicon project, psycholinguistic database, reaction times (RT), response latency, word frequencyFrenchFerrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., MĂ©ot, A., Augustinova, M., & Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42, 488-496.
http://www.lexique.org/LexiqueFrench lexical database for 135, 000 wordslexicon project, psycholinguistic database, reaction times (RT), response latency, word frequencyFrench
http://link.springer.com/article/10.3758%2FBRM.42.4.992Malay Lexicon ProjectMalay lexical database for 9,592 wordslexicon project, psycholinguistic database, reaction times (RT), response latency, word frequencyMalayYap, M. J., Liow, S. J. R., Jalil, S. B., & Faizal, S. S. B. (2010). The Malay Lexicon Project: A database of lexical statistics for 9,592 words. Behavior research methods, 42(4), 992-1003.
http://elexicon.wustl.edu/English Lexicon Project (ELP)English lexical database for 40,481 wordslexicon project, psycholinguistic database, reaction times (RT), response latency, lexical decision task, word naming, contextual diversity, neighborhood density, bigram probability, part of speech (POS), levenshtein distance, SUBTLEXEnglishBalota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459.
http://journal.frontiersin.org/article/10.3389/fpsyg.2010.00174/abstractDutch Lexicon ProjectDutch lexical database for 14,000 wordslexicon project, psycholinguistic database, reaction times (RT), response latency, word frequencyDutchKeuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono-and disyllabic words and nonwords. Frontiers in Psychology, 1, 174.
http://crr.ugent.be/programs-data/lexicon-projectsBritish Lexicon Project (BLP)British English lexical database for 28,000 wordslexicon project, psycholinguistic database, reaction times (RT), response latency, bigram probability, trigram probability, word frequencyEnglishKeuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287-304.
http://crr.ugent.be/programs-data/word-prevalence-valuesDutch Word Knowledge & PrevalenceWord prevalence values for 54,319 Dutch words from nearly 300,000 participantsword prevalence, word knowledge, lexical knowledgeDutchKeuleers, E., Stevens, M., Mandera, P., & Brysbaert, M. (2015). Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. The Quarterly Journal of Experimental Psychology, (ahead-of-print), 1-28.
http://semshifts.iling-ran.ru/Database of Semantic Shifts in the Languages of the World3,690 semantic connections in the world's languages (polysemy, semantic changes)polysemy, semantic change, semanticsmultilingualZalizniak, A. A., Bulakh, M., Ganenkov, D., Gruntov, I., Maisak, T., & Russo, M. (2012). The catalogue of semantic shifts as a database for lexical semantic typology. Linguistics, 50, 633-669.
http://w3.usf.edu/FreeAssociation/USF Word Association NormsFree word association data for 72,000 word pairsword association, semanticsEnglishNelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36, 402-407.
http://link.springer.com/article/10.3758/BRM.41.2.558Modality Exclusivity Norms for AdjectivesHow much 423 adjectives are associated with different modalities/senses (e.g., vision, hearing)modality exclusivity, senses, semantics, adjectivesEnglishLynott, D., & Connell, L. (2009). Modality exclusivity norms for 423 object properties. Behavior Research Methods, 41, 558-564.
http://www.psych.rl.ac.uk/MRC Psycholinguistics DatabaseLexical database for Englishlexicon project, psycholinguistic database, reaction times (RT), response latency, word frequency, concreteness, familiarity, imageability, meaningfulness, part of speech (POS), lexical category, part of speech, number of phonemes, syllables, letters, stress-marked phonetic transcriptionEnglishColtheart, M. (1981). The MRC Psycholinguistic Database. Quarterly Journal of Experimental Psychology, 33A, 497-505.; Wilson, M.D. (1988). The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20, 6-11.
http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/overview.htmSUBTLEX-USFrequencies from 51 million word tokensword frequency, contextual diversityEnglishBrysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977-990.; Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior research methods, 44(4), 991-997.
http://www.natcorp.ox.ac.uk/British National Corpus (BNC)Corpus based with 100 million wordscorpusEnglish
http://wals.info/World Atlas of Language Structures (WALS)Typological databasetypology, grammatical database, syntax, morphology, phonologymultilingualDryer, Matthew S. & Haspelmath, Martin (eds.) (2013). The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://wold.clld.org/The World Loanword Database (WOLD)Loanword database with mini-dictionaries for 41 languages; words are coded for likelihood of being a loanwordloanwords, borrowing, language contactmultilingualHaspelmath, M., & Tadmor, U. (Eds.). (2009). Loanwords in the world's languages: a comparative handbook. Walter de Gruyter.
https://sites.google.com/site/kenmcraelab/norms-dataSemantic Feature NormsSemantic feature norms for 541 concepts from 725 participantssemantic features, semantics, feature norms, distinctive features, objects and events, propertiesEnglishMcRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior research methods, 37(4), 547-559.
https://books.google.com/ngramsGoogle NgramLarge corpora of books with word frequencies and ngram frequencies from English, German, French, Italian, Spanish, Russian, Chinese and Hebrew, POS-taggedngrams, word frequencymultilingualMichel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., ... & Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized books. science, 331(6014), 176-182.
http://wordnet.princeton.edu/WordNetMachine-readable dictionary with semantic relations for Englishwordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchiesEnglishMiller GA. WordNet: a lexical database for English. Communications of the ACM. 1995;38:39-41.; Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press; 1998.
http://www.eat.rl.ac.uk/Edinburgh Associative ThesaurusEnglish word association normsword association, semanticsEnglishKiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973) An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), The Computer and Literary Studies. Edinburgh: University Press.
http://www.illc.uva.nl/EuroWordNet/Euro WordNetWordnets for several European languageswordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchiesmultilingual
http://corpus.byu.edu/coca/Corpus of Contemporary American English (COCA)440 million word corpus of contemporary American EnglishcorpusEnglish
http://corpus.byu.edu/coha/Corpus of Historical American English (COHA)385 million word corpus of historical American EnglishcorpusEnglishDavies, M. (2011). The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English. Literary and Linguistic Computing 25: 447-65.
http://lingweb.eva.mpg.de/ids/Intercontinental Dictionary Series (IDS)Comparative lexical databaselexicon, dictionary, vocabularymultilingual
http://phoible.org/PHOIBLE: The world's largest database of phonological inventoriesCross-linguistic phoneme inventory datatypology, phonemes, phonetics, phonology phoneme inventorymultilingualMoran, S., & McCloy, D., & Wright, R. (eds.) 2014. PHOIBLE Online. Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://137.122.133.199/~Jeff/pbase/index.htmlP-baseDatabase of several thousand sound patterns in 500+ languagesphonetics, phonology, typologymultilingual
http://137.122.133.199/~Jeff/phonetic_similarity/index.htmlPhonetic Similarity databaseSimilarity ratings for 51 segmentsphonetics, phonologymultilingual
http://www.autotyp.uzh.ch/AUTOTYPTypological databaselanguage family, language area, genealogy, geography, genealogical data, typology, language historymultilingualBickel, B. & J. Nichols, 2002. Autotypologizing databases and their use in fieldwork. In Proc. Int. LREC Workshop on Resources and Tools in Field Linguistics. Las Palmas, 25-26 May 2002.
http://semanticweb.kaist.ac.kr/home/index.php/KAIST_CorpusKAIST Korean Corpus70 million Korean eojeol corpuscorpusKorean
http://www.statmt.org/europarl/Europarl Parallel Corpus (EPC)Parallel text with up to 60 million words for 20 languagsparallel corpus, corpus, translationmultilingual
http://paralleltext.info/data/Parallel Bible Corpus (PBC)Parallel corpus of the bible with around 900 languages from around 80 different language familiesparallel corpus, corpus, translationmultilingual
http://unicode.org/udhr/Universal Declaration of Human RightsParallel corpus of the declaration of human rights for 400 languagesparallel corpus, corpus, translationmultilingual
http://wacky.sslmit.unibo.it/doku.php?id=corporaWacky CorpusSyntactically annotated or POS-tagged corpora with up to 2 billion words for English, French, German and Italian, also includes Italian Wikipedia corpuscorpus, syntax, treebank, part of speech (POS), WikipediaEnglish, French, German, Italian
http://asjp.clld.org/Automated Similarity Judgment Project (ASJP)Word lists of around 6000 languagesword list, typology, vocabularymultilingualWichmann, SĂžren, AndrĂ© MĂŒller, Annkathrin Wett, Viveka Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W. Holman, Sebastian Sauppe, Zarina Molochieva, Pamela Brown, Harald Hammarström, Oleg Belyaev, Johann-Mattis List, Dik Bakker, Dmitry Egorov, Matthias Urban, Robert Mailhammer, Agustina Carrizo, Matthew S. Dryer, Evgenia Korovina, David Beck, Helen Geyer, Patience Epps, Anthony Grant, and Pilar Valenzuela. 2013. The ASJP Database.
https://www.ethnologue.com/EthnologueComprehensive catalogue of the world's languagescatalogue, typologymultilingual
http://glottolog.org/GlottologComprehensive reference information for the world's languagescatalogue, typologymultilingual
http://opus.lingfil.uu.se/Open parallel corpusA collection of parallel corpora including 71 million sentences for about 30 languagesparallel corpus, corpus, translationmultilingual
http://archive.org/browse.php?field=subject&mediatype=texts&collection=rosettaprojectRosetta Collection in The Internet ArchiveMedia files and documents about the languages from the world collected by the Rosetta foundationreference grammar, word listmultilingual
http://data.worldbank.org/World Bank Open DataDemographic and geographical data on the world's countriesdemographic data, country, migration, bilingualism, language usenon-linguistic
http://link.springer.com/article/10.3758/s13428-012-0267-0Modality Exclusivity Norms for NounsModality norms from for 400 English nounsperceptual attributes, modality exclusivity, senses, semantics, vision, sight, hearing, touch, taste, smellEnglishLynott, D., & Connell, L. (2013). Modality exclusivity norms for 400 nouns: The relationship between perceptual experience and surface word form. Behavior research methods, 45(2), 516-526.
http://link.springer.com/article/10.3758/s13428-010-0038-8Modality Exclusivity Norms for AdjectivesModality norms from 400 American English participants for 387 adjectivesperceptual attributes, modality exclusivity, senses, semantics, vision, sight, hearing, touch, taste, smellEnglishvan Dantzig, S., Cowell, R. A., Zeelenberg, R., & Pecher, D. (2011). A sharp image or a sharp knife: Norms for the modality-exclusivity of 774 concept-property items. Behavior Research Methods, 43(1), 145-154.
http://link.springer.com/article/10.3758/s13428-012-0215-zPerceptual and motor attribute ratingsPerceptual and motor attribute ratings for 559 concepts based on 376 American English participantsgraspability, perceptual attributes, semanticsEnglishAmsel, B. D., Urbach, T. P., & Kutas, M. (2012). Perceptual and motor attribute ratings for 559 object concepts. Behavior research methods, 44(4), 1028-1041.
http://link.springer.com/article/10.3758/s13428-012-0242-9Sensory experience ratingsSensory experience ratings for 5857 English words based on 63 participantsperceptual attributes, semanticsEnglishJuhasz, B. J., & Yap, M. J. (2013). Sensory experience ratings for over 5,000 mono-and disyllabic words. Behavior research methods, 45(1), 160-168.
http://link.springer.com/article/10.3758/s13428-014-0488-5Manipulability and naming norms for photographsManipulability ratings and naming RT norms for photographsperceptual attributes, manipulabilitynon-linguistic
http://link.springer.com/article/10.3758/BRM.42.1.82Manipulability, familiarity and AOA for photographsManipulability, familiarity and AOA for photographsperceptual attributes, manipulability, age of acquisition (AOA), familiaritynon-linguisticSalmon, J. P., McMullen, P. A., & Filliter, J. H. (2010). Norms for two types of manipulability (graspability and functional usage), familiarity, and age of acquisition for 320 photographs of objects. Behavior Research Methods, 42(1), 82-95.
http://www.tandfonline.com/doi/abs/10.1080/13825585.2010.540849Spanish norms for photographs140 color images that have been normed by 106 Spanish speakers on age of acquisition, familiarity, manipulability and other measuresage of acquisition (AOA), perceptual attributes, manipulabilitySpanishMoreno-MartĂ­nez, F. J., Montoro, P. R., & Laws, K. R. (2011). A set of high quality colour images with Spanish norms for seven relevant psycholinguistic variables: The Nombela naming test. Aging, Neuropsychology, and Cognition, 18(3), 293-327.
http://link.springer.com/article/10.3758/s13428-014-0466-yFrench acronym normsPsycholinguistic norms for French acronymsacronyms, reading time (RT), age of acquisition ratings (AOA), subjective frequency, imageabilityFrenchBonin, P., MĂ©ot, A., Millotte, S., & Bugaiska, A. (2014). Norms and reading times for acronyms in French. Behavior research methods, 47(1), 251-267.
http://link.springer.com/article/10.3758/s13428-014-0454-2Spanish AOA normsSubjective age-of-acquisition norms for 7,039 Spanish wordsage of acquisition (AOA)SpanishAlonso, M. A., Fernandez, A., & DĂ­ez, E. (2015). Subjective age-of-acquisition norms for 7,039 Spanish words. Behavior research methods, 47(1), 268-274.
http://link.springer.com/article/10.3758/s13428-014-0467-xPersian emotional speechEmotional speech from 470 sentences normed by 1,126 Persian native speakersemotion, emotional speechPersianKeshtiari, N., Kuhlmann, M., Eslami, M., & Klann-Delius, G. (2015). Recognizing emotional speech in Persian: A validated database of Persian emotional speech (Persian ESD). Behavior research methods, 47(1), 275-294.
http://www.csl.psychol.cam.ac.uk/propertynorms/CSLB Property NormsFeature norms for 866 concrete concepts by 123 native speakers of British Englishsemantic features, semanticsEnglishDevereux, B. J., Tyler, L. K., Geertzen, J., & Randall, B. (2014). The Centre for Speech, Language and the Brain (CSLB) concept property norms. Behavior research methods, 46(4), 1119-1127.
http://link.springer.com/article/10.3758/s13428-013-0426-yANGST German affectiveness ratingsValence, arousal, dominance and other ratings for 1,003 German wordsemotion, valence, dominance, arousal, affect, positive, negativeGermanSchmidtke, D. S., Schröder, T., Jacobs, A. M., & Conrad, M. (2014). ANGST: Affective norms for German sentiment terms, derived from the affective norms for English words. Behavior Research Methods, 46(4), 1108-1118.
http://link.springer.com/article/10.3758/s13428-013-0431-1French affective normsAffective norms for 1,031 French words by 469 French speakersemotion, valence, dominance, arousal, affect, positive, negativeFrenchMonnier, C., & Syssau, A. (2014). Affective norms for French words (FAN). Behavior research methods, 46(4), 1128-1137.
http://link.springer.com/article/10.3758/s13428-013-0409-zGender stereotypicality normsGender stereotypicality norms for role nouns in 7 European languagesgender, stereotypes, stereotypicality, sociolinguisticsCzech, English, French, German, Italian, Norwegian, SlovakMisersky, J., Gygax, P. M., Canal, P., Gabriel, U., Garnham, A., Braun, F., ... & Sczesny, S. (2014). Norms on the gender perception of role nouns in Czech, English, French, German, Italian, Norwegian, and Slovak. Behavior research methods, 46(3), 841-871.
http://link.springer.com/article/10.3758/s13428-013-0400-8PhonItaliaPhonological representatons for 120,000 Italian word formsphonological representation, phonology, transcriptionItalianGoslin, J., Galluzzi, C., & Romani, C. (2014). PhonItalia: a phonological lexicon for Italian. Behavior research methods, 46(3), 872-886.
http://link.springer.com/article/10.3758/s13428-013-0405-3Italian affective normsAffective norms for 1,121 Italian wordsemotion, valence, dominance, arousal, affect, positive, negativeItalianMontefinese, M., Ambrosini, E., Fairfield, B., & Mammarella, N. (2014). The adaptation of the affective norms for english words (ANEW) for Italian. Behavior research methods, 46(3), 887-903.
http://link.springer.com/article/10.3758/s13428-013-0370-xSubjetive ASL frequencySubjective frequency ratings for 432 ASL signs from 59 native deaf signerssubjective frequency, familiarity, American Sign Language (ASL)ASLMayberry, R. I., Hall, M. L., & Zvaigzne, M. (2014). Subjective frequency ratings for 432 ASL signs._Behavior research methods,_46(2), 526-539.
http://link.springer.com/article/10.3758/s13428-013-0389-zJapanese-English similarity for translation equivalents193 Japanese-English word pairs are rated for phonological and semantic similarityphonological similarity, phonetics, phonology, semantics, semantic similiarity, translation, translation equivalentJapanese;EnglishAllen, D., & Conklin, K. (2014). Cross-linguistic similarity norms for Japanese–English translation equivalents. Behavior research methods, 46(2), 540-563.
http://link.springer.com/article/10.3758/s13428-013-0388-0Portuguese Free Association normsFree association norms for 139 Portuguse words from children of various ageschildren, language acquisition, word association, free associationPortugueseComesaña, M., Fraga, I., Moreira, A. J., Frade, C. S., & Soares, A. P. (2014). Free associate norms for 139 European Portuguese words for children from different age groups. Behavior research methods, 46(2), 564-574.
http://link.springer.com/article/10.3758/s13428-013-0376-4Turkish image normsTurkish AOA, familiarity and other norms for 260 pictures from 277 native Turkish speakersfamiliarity, age of acquisition (AOA), word frequencyTurkishRaman, I., Raman, E., & Mertan, B. (2014). A standardized set of 260 pictures for Turkish: Norms of name and image agreement, age of acquisition, visual complexity, and conceptual familiarity. Behavior research methods, 46(2), 588-595.
http://link.springer.com/article/10.3758/s13428-013-0355-9Chinese Lexicon projectReaction times for 2500 single characters and associated lexical norms (frequency, contextual diversity etc.)contextual diversity, word frequency, lexical decision task, reaction times (RT), response latencyChineseSze, W. P., Liow, S. J. R., & Yap, M. J. (2014). The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behavior research methods, 46(1), 263-273.
http://link.springer.com/article/10.3758/s13428-013-0358-6Dutch action normsDutch AOA, word frequency and other norms for 124 line drawingsage of acquisition (AOA), perceptual attributes, actionDutchShao, Z., Roelofs, A., & Meyer, A. S. (2014). Predicting naming latencies for action pictures: Dutch norms. Behavior research methods, 46(1), 274-283.
http://link.springer.com/article/10.3758/s13428-014-0488-5French object normsManipulability, graspability and pantomimability norms by French speakers for 560 photographsiconicity, manipulability, movability, perceptual attributes, graspability, semanticsnon-linguisticGuérard, K., Lagacé, S., & Brodeur, M. B. (2014). Four types of manipulability ratings and naming latencies for a set of 560 photographs of objects. Behavior research methods, 47(2), 443-470.
http://language.psy.auckland.ac.nz/austronesian/Austronesian Basic Vocabulary Database210 vocabulary items in almost 1000 Austronesian languagesbasic vocabulary, dictionary, AustronesianAustronesianGreenhill, S. J., Blust, R., & Gray, R. D. (2008). The Austronesian basic vocabulary database: from bioinformatics to lexomics. Evolutionary bioinformatics online, 4, 271-283.
http://language.psy.auckland.ac.nz/bantu/Bantu Basic Vocabulary Database430 vocabulary items from 10 Bantu languagesbasic vocabulary, dictionary, BantuBantu
http://ielex.mpi.nl/Indo-European Lexical Cognacy Database207 vocabulary items in 150 Indo-European languagesbasic vocabulary, dictionary, Indo-European, cognatesIndo-European
http://multitree.org/MultiTree: A digital library of language relationshipsResource for language relatedness and genealogy; contains trees for many language familieslanguage family, genealogy, linguistic history, reconstruction, protolanguagemultilingual
https://sites.google.com/site/referencelexicon/RefLex - Reference LexiconAround 60,000 lexical entries for around 500 African languages with phonotactic and cognacy codingAfrica, cognacy, basic vocabulary, dictionary, language historymultilingual
http://phonotactics.anu.edu.au/ANU World Phonotactics DatabasePhonotactic data for over 2000 languages and segmental data for around 4700 languagestypology, phonemes, phoneme inventory, phonology, phonotactics, segmentsmultilingualDonohue, M., Hetherington, R., McElvenny, J., & Dawson, V. (2013). World phonotactics database. Department of Linguistics, The Australian National University.
http://www.worldvaluessurvey.org/wvs.jspWorld Value SurveyData on socioeconomic and demographic variables, including language background, for over 85,000 respondents in 57 countrieslanguage use, bilingualism, demographic datamultilingualWorld Values Survey Association (2009). World Values Survey 1981-2008 Official Aggregate v. 20090901. Madrid: ASEP/JDS.
http://www.mirjamernestus.nl/Ernestus/NCCFr/Nijmegen Corpus of Casual French35 hours of orthographically annoted high-quality recordings with 46 French speakers conversing among friends.phonetics, video, speech, annotated corpusFrenchTorreira, F., Adda-Decker, M., & Ernestus, M. (2010). The Nijmegen Corpus of Casual French. Speech Communication, 52, 201-221.
http://www.cstr.ed.ac.uk/projects/unisyn/Unisyn LexiconMulti-accent dictionary of EnglishEnglish dialects, lexicon, accentsEnglishFitt, S. (2002). Unisyn lexicon release. The Center for Speech Technology Research, University of Edinburgh.
http://homepages.inf.ed.ac.uk/korin/sitenew/Research/Combilex/index.htmlCombilex speech technology lexiconMulti-accent dictionary of EnglishEnglish dialects, lexicon, accentsEnglishFitt, S., & Richmond, K., & Clark, R. Combilex.
http://www.mngu0.org/mngu0 datasetArticulatory EMA, MRI, video, audio and 3D scan data frome one British male speakerarticulation, articulatory data, MRI, EMA, video, phonetics, speech productionEnglishRichmond, K. (2011). Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In Proc. Interspeech, pages 1505-1508, Florence, Italy, August 2011.; Steiner, I., Richmond, K., Marshall, I., & Gray, C. D. (2012). The magnetic resonance imaging subset of the mngu0 articulatory corpus. Journal of the Acoustical Society of America, 131(2), 106-111.
http://link.springer.com/article/10.3758/BF03200831Touch-related adjective norms306 words that are categorized for various haptic properties such as roughness and weightperceptual attributes, semantics, feelingEnglishStadtlander, L. M., & Murdoch, L. D. (2000). Frequency of occurrence and rankings for touch-related adjectives. Behavior Research Methods, Instruments, & Computers, 32(4), 579-587.
http://wordbank.stanford.edu/MacArthur-Bates Communicative Devleopment Inventories (MCDI)Database of children's early vocabulary development and gesturesdevelopmental, language acquisition, early vocabulary, age of acquisition (AOA), gesture, multimodalEnglish, Danish, Norwegian, Turkish, Spanish, Russian, Mandarin, Swedish, German, Cantonese, Italian, Croatian, HebrewJĂžrgensen, R. N., Dale, P. S., Bleses, D., & Fenson, L. (2010). CLEX: A cross-linguistic lexical norms database. Journal of child language, 37(02), 419-428.
http://www.iphod.com/Irvine Phonotactic Online Dictionary (IPhOD)Collection of English words and pseudowords with respect to number of phonological variablesphonotactics, biphoneme probability, bigram probability, triphoneme probability, trigram probability, segments, phonemes, syllablesEnglishVaden, K.I., Halpin, H.R., Hickok, G.S. (2009). Irvine Phonotactic Online Dictionary, Version 2.0.
http://st2.ullet.net/StressTyp2Typological database with stress and accent patterns 750 languagestypology, stress, accentmultilingual
http://phonology.cogsci.udel.edu/dbs/stress/UD Phonology Lab Stress Pattern DatabaseDominant stress patterns of the world's languagestypology, stress, accentmultilingual
http://languagelink.let.uu.nl/anatyp/Anaphora Typology DatabaseAnaphora database with example sentencestypology, anaphora, syntaxmultilingualDimitriadis, A., Everaert, M., Reinhart, T., & Reuland, E. (2005). Anaphora Typology Database.
http://languagelink.let.uu.nl/fpps/Free Personal Pronoun System databasePersonal pronoun system's of the worlds languagestypology, syntax, morphology, personal pronoun, morphosyntaxmultilingual
http://reduplication.uni-graz.at/Graz Database on ReduplicationDatabase that contains reduplication patterns of the world's languagestypology, reduplication, syntax, morphology, morphosyntaxmultilingualHurch, B. (2005-). Graz Database on Reduplication. http://reduplication.uni-graz.at/
http://www.personal.uni-jena.de/~mu65qev/tdir/Typological Database of Intensifiers and ReflexivesDatabase that contains intensifiers and reflexive patterns of the world's languagestypology, intensifier, reflexives, syntaxmultilingualGast, V., D. Hole, E. König, P. Siemund, S. Töpper (2007). Typological Database of Intensifiers and Reflexives. Version 2.0. http://www.tdir.org.
http://web.phonetik.uni-frankfurt.de/upsid.htmlUCLA Phonological Segment Inventory Data (UPSID)Contains phonological inventories for 451 languagesphonology, phoneme inventory, typologymultilingualMaddieson, I. (1984). Patterns of sounds. Cambridge studies in speech science and communication. Cambridge: Cambridge University Press.
http://apics-online.info/Atlas of Pidgin and Creole Language Structures (APiCS)Grammatical and lexical structures of 75 pidgin and creole languagesphonology, lexicon, negation, syntax, morphology, morphosyntax, typology, pidgin & creole languages, word ordermultilingualMichaelis, S. M.,Maurer, P.,Haspelmath, M., & & Huber, M. (eds.) (2013). Atlas of Pidgin and Creole Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://valpal.info/Valency Patterns Leipzig Online Database (ValPal)Valency patterns of 36 languagestypology, syntaxmultilingualHartmann, I., Haspelmath, M., & Taylor, B. (Eds.) (2013). Valency Patterns Leipzig. Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://ewave-atlas.org/Electronic World Atlas of Varieties of EnglishOver 235 linguistic features mapped for 50 varieties of EnglishEnglish dialects, phonology, lexicon, morphology, syntax, discourse, word order, tense, aspectmultilingualKortmann, B., & Lunkenheimer, K. (Eds.) (2013). The Electronic World Atlas of Varieties of English. Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://lingweb.eva.mpg.de/numeral/Numeral System's of the World's languagesData on numeral systems for about 4000 languages of the worldtypology, numeral systemsmultilingual
http://afbo.info/A world-wide survey of affix borrowingA database of 101 languages where affixes have been borrowed (total of 657 affixed)typology, affixes, morphology, morphosyntaxmultilingualSeifart, F. (2013). AfBO: A world-wide survey of affix borrowing. Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://sails.clld.org/South American Indigenous Language Structures (SAILS)A database of 604 linguistic features from 167 American Indigenous languagestypology, syntax, morphology, morphosyntax, phonology, tense, aspect, evidentiality, word order, agreementmultilingualMuysken, Pieter, Harald Hammarström, Olga Krasnoukhova, Neele MĂŒller, Joshua Birchall, Simon van de Kerke, Loretta O'Connor, Swintha Danielsen, Rik van Gijn & George Saad. 2014. South American Indigenous Language Structures (SAILS) Online. Leipzig: Online Publication of the Max Planck Institute for Evolutionary Anthropology. (Available at http://sails.clld.org)
http://www.homophone.com/Homophone.comInformal list of English homophoneshomophones, lexiconmultilingual
http://intelligencesquaredus.org/Intelligence SquaredPolitical debates with transcripts and votes by audience memberspolitics, debate, argument, corpusEnglish
http://link.springer.com/article/10.3758/s13428-014-0552-1Nencki Affective Word List (NAWL) for PolishEmotional valence, arousal and imageability ratings for 2,902 Polish wordsemotion, valence, arousal, positive, negative, imageability, word frequency, part of speech (POS), word lengthPolishRiegel, M., Wierzba, M., Wypych, M., Ć»urawski, Ɓ., JednorĂłg, K., Grabowska, A., & Marchewka, A. (2015). Nencki Affective Word List (NAWL): the cultural adaptation of the Berlin Affective Word List–Reloaded (BAWL-R) for Polish. Behavior research methods, 1-15.
http://link.springer.com/article/10.3758/s13428-011-0059-yDiscrete emotion norms for German (DENN-BAWL)Discrete emotion ratings for for about 2000 German nounsemotion, valence, arousal, positive, negativeGermanBriesemeister, B. B., Kuchinke, L., & Jacobs, A. M. (2011). Discrete emotion norms for nouns: Berlin affective word list (DENN–BAWL). Behavior research methods, 43(2), 441-448.
http://www.lehoa.macmate.me/MelissaVo/BAWL-R.htmlBerlin Affective Word List Reloaded (BAWL-R)Emotional arousal and valence ratings for about 2900 German nounsemotion, valence, arousal, positive, negativeGermanVĂ”, M. L., Conrad, M., Kuchinke, L., Urton, K., Hofmann, M. J., & Jacobs, A. M. (2009). The Berlin affective word list reloaded (BAWL-R). Behavior research methods, 41(2), 534-538.
http://talkbank.org/SLA/Second Language Acquisition ResourcesContains transcribed corpora with audio from several languages relevant for second language acquisition researchsecond language acquisition (SLA), L2, bilingualismmultilingual
http://childes.psy.cmu.edu/phon/PhonBank Database for Phonological DevelopmentContains corpora and phonological information on child language developmentlanguage development, language acquisition, phonetics, phonology, clinical corporaEnglish, French, Portuguese, German, Swedish, Dutch, Indonesian, Japanese, Taiwanese, Cantonese, Greek, Arabic, Berber, Romanian, Polish
http://childes.psy.cmu.edu/Child Language Data Exchange System (CHILDES)Transcribed and annotated child language corpora for several languageslanguage development, language acquisition, annotated corpus, corpora, clinical corporaCeltic, Irish, Welsh, Cantonese, Chinese, Indonesian, Japanese, Korean, Taiwanese, Thai, English, Afrikaans, Dutch, Danish, German, Icelandic, Norwegian, Swedish, Catalan, Spanish, French, Italian, Portuguese, Romanian, Croatian, Polish, Russian, Serbian, Slovenian
http://childfreq.sumsar.net/ChildFreq: CHILDES frequency toolOnline access tool to CHILDES word frequency datalanguage development, language acquisition, word frequencyEnglishBaath, R. (2014). ChildFreq: An online tool to explore word frequencies in child language.
http://www.opensubtitles.org/en/searchOpen SubtitlesOver three million subtitle files for data from several languagessubtitles, corpus, parallel corpusmultilingual
http://corpus.quran.com/Quranic Arabic CorpusMorphological annotation, syntactic treebank and semantic ontology for the entire Holy QuranQuran, corpus, treebank, morphological annotation, syntax, semantics, ontologyArabic
http://www.alc.manchester.ac.uk/subjects/lel/research/projects/archer/ARCHER: A Representative Corpus of Historical English RegistersA multi-genre English corpus ranging from 1600 to 1999language history, historical corpus, registersEnglish
https://perswww.kuleuven.be/~u0044428/clmet3_0.htmCLEMET: Corpus of Late Modern English Texts34 million words of running text from 1710 to 1920language history, historical corpus, registersEnglishDiller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35.
http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/Helsinki Corpus of English TextsContains 1.5 million words from English texts ranging from 730 AD to 1710 ADOld English, Middle English, Early Modern English, language history, historical corpusEnglish
http://www.helsinki.fi/varieng/CoRD/corpora/HCOS/index.htmlHelsinki Corpus of Older Scots (HCOS)Contains 0.8 million words of Scottish English from 1450 AD to 1700 ADScottish, language history, historical corpusEnglishThe Helsinki Corpus of Older Scots (1995). Department of Modern Languages, University of Helsinki. Compiled by Anneli Meurman-Solin.
http://www-users.york.ac.uk/~sp20/corpus.htmlBrooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old EnglishSyntactically annotated and POS-tagged corpus of Old EnglishOld English, language history, historical corpus, syntaxEnglish
http://www-users.york.ac.uk/~lang22/YcoeHome1.htmYork-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE)A corpus of 1.5 million words of Old English texts, syntactically annotated and POS-taggedOld English, prose, language history, historical corpus, syntaxEnglish
http://www-users.york.ac.uk/~lang18/pcorpus.htmlYork-Helsinki Parsed Corpus of Old English PoetryCorpus of Old English poetry, syntactically annotated and POS-taggedOld English, poetry, language history, historical corpus, syntaxEnglish
http://www.ling.upenn.edu/hist-corpora/Penn Corpora of Historical English (PPCME2, PPCEME, PPCMBE)Middle English, Early Modern English and Modern English corpora, syntactically annotated and POS-taggedMiddle English, Early Modern English, language history, historical corpus, syntaxEnglish
http://ota.ox.ac.uk/The University of Oxford Text Archive (OTA)Text archives (with some audio and video data) for lots of English texts from many different time periodshistorical corpus, text archive, corpora, language historyEnglish
http://www.helsinki.fi/varieng/domains/CEEC.htmlCorpus of Early English Correspondence (CEEC)Compiled with historical sociolinguistics in mind, a more than 6 million word corpus of English correspondences (1410-1800) from thousands of writershistorical corpus, letters, language history, corrspondence, sociolinguisticsEnglish
https://www.tu-chemnitz.de/phil/english/sections/linguist/real/independent/lampeter/lamphome.htmThe Lampeter Corpus of Early Modern English TextsEnglish texts from 1640 to 1740 within the categories religion, politics, economy, science, law and miscellaneoushistorical corpus, language history, Early Modern EnglishEnglish
http://www.comp.leeds.ac.uk/eric/latifa/research.htmCorpus of Contemporary Arabic (CCA)A corpus of 0.8 million Arabic wordscorpusArabic
http://www.thelatinlibrary.com/The Latin LibraryA collection of Latin texts from several authorscorpusLatin
http://www.hf.uio.no/iln/om/organisasjon/tekstlab/prosjekter/nowac/index.htmlNorwegian Web as Corpus (NoWaC)A web-based corpus of 700 million Norwegian wordscorpusNorwegianGuevara, Emiliano Raul (2010). NoWaC: a large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, Association for Computational Linguistics, 1 - 7.
http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.htmlTIGER CorpusGerman news corpus of 0.9 tokens from the Frankfurter Rundschaucorpus, news, mediaGermanBrants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., & Uszkoreit, H. (2004). TIGER: Linguistic Interpretation of a German Corpus. Journal of Language and Computation, 2004 (2), 597-620.
http://www.moderna.uu.se/slaviska/ryska/corpus/Uppsala Russian CorpusRussian corpus of different genres with transliterationscorpusRussian
http://www.arabiclearnercorpus.com/Arabic Larner Corpus (ALC)0.2 million Arabic words produced from 942 students from 66 different L1 backgroundslearner corpus, second language acquisition (SLA), bilingualismArabic
http://www.unicaen.fr/gazette/La Gazette de RenaudotHistorical corpus of French gazettes/newspapershistorical corpus, language historyFrench
http://www.engelska.uu.se/Research/English_Language/Research_Areas/Electronic_Resource_Projects/USE-Corpus/?languageId=1Uppsala Student English Corpus (USE)Corpus of 1500 essays written by 440 Swedish university studentslearner corpus, second language acquisition (SLA), bilingualismSwedish
http://www.engelska.uu.se/Research/English_Language/Research_Areas/Electronic_Resource_Projects/A_Corpus_of_English_Dialogues/Corpus of English Dialogues (CED)Corpus of 1.1 million words of English dialogues (spoken interactions) from 1560-1760historical corpus, dialogue, language history, corpora, Early Modern EnglishEnglish
http://www.gutenberg.org/Project GutenbergA collection of 50000 free ebooksbook collection, text archiveEnglish
http://www.nytimes.com/ref/membercenter/nytarchive.htmlNew York Times Article ArchiveA collection of all New York Times articles starting with 1851 to presenttext archive, news, media, newspaperEnglish
http://link.springer.com/article/10.3758/BF03195349Bird Age of Acquisition and Imageability ratingsImageability and age of acquisition norms for a set of 2645 English wordsage of acquisition (AOA), imageabilityEnglish
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.htmlOpinion LexiconA list of 6800 positive and negative English opinion wordsopinion mining, sentiment analysis, emotional valence, positive, negativeEnglish
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.htmlAmazon Product Review DataMore than 5.8 million reviews of Amazon productscorpus, media, Amazon, reviewsEnglish
http://www.yelp.com/dataset_challengeYelp Challenge Review DataAbout 1.6 million reviews from 360000 Yelp userscorpus, media, Yelp, reviewsEnglish
http://sentiwordnet.isti.cnr.it/SentiWordNetA lexical resource for opinion miningemotional valence, affect, opinion mining, sentiment analysis, wordnet, positive, negativeEnglish
http://dialect.topography.chass.utoronto.ca/dt_atlas.phpAtlas of Dialect TopograhyCross-regional dialect topography, largely focused on Canadasociolinguistics, dialects, Canada, Canadian English, regional variantsEnglish
http://austlang.aiatsis.gov.au/disclaimer.phpAUSTLANG: Australian Indigeneous Languages DatabaseClassification and language information on Australian languages, including mapsAustralian languages, classification, geography, map, speaker information, language usemultilingual
http://www.baydat.uni-wuerzburg.de:8080/cocoon/baydat/baydatBayDat: Bayrische DialektdatenbankDatabase of Bavarian German dialectsdialects, Bavaria, Germany, sociolinguistics, mapGerman
http://lacito.vjf.cnrs.fr/pangloss/La Collection PanglossDatabase of audio materials from several of the world's languagesaudio recordings, typology, world's languagesmultilingual
http://corpus.byu.edu/time/Time Magazine Corpus100 million word corpus of TIME magazinecorpus, media, news, magazineEnglish
http://corpus.byu.edu/wiki/Wikipedia corpus1.9 billion word corpus from Wikipedia (4.4 million articles)corpus, WikipediaEnglish
http://corpus.byu.edu/glowbe/Corpus of Global Web-Based English (GloWbE)1.9 billion word corpus from 1.8 million web pagescorpus, web language, web corpusEnglish
http://corpus.byu.edu/can/Corpus of Canadian English (STRATHY)50 million word corpus of Canadian English ranging from 1920 to 2000corpus, Canada, historical corpus, language historyEnglish
http://www.corpusdelespanol.org/CORPUS DEL ESPAÑOL100 million word corpus from 20000 Spanish texts spanning a time range from 1200 to the 1900scorpus, language history, historical corporaSpanish
http://www.corpusdoportugues.org/O CORPUS DO PORTUGUES45 million word corpus of Portuguese spanning from 1300 to 1900corpus, language history, historical corporaPortuguese
http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.phpDigital Corpus of Sanskrit (DCS)3.2 million words of Sanskrit with collocatescorpus, language history, historical corporaSanskrit
http://xtone.linguistics.berkeley.edu/about.phpCross-Linguistic Tonal Database (Xtone)Information on lexical tone from 82 different languagestypology, phonology, lexical tone, tonal systemsmultilingual
http://tls.uni-hd.de/home_en.lassoThesaurus Linguae SericaeA historical and comparative encyclopedia of Chinese conceptual schemes, with corpora and semantic relationslanguage history, historical corpus, semantic relations, encyclopedia, historical phonologyChinese
http://sealang.net/assam/Tai and Tibeto-Burman language corporaTranscribed and translated texts of Tibeto-Burman languagesTibeto-Burman, corpus, corpora, endangered languagesAhom, Aiton, Khamti, Khamyang, Singpho, Turung, Tangsa
http://turing.iis.sinica.edu.tw/treesearch/Sinica TreebankParsed corpus of 360000 Chinese wordstreebank, corpus, syntaxChinese
http://romani.humanities.manchester.ac.uk/rms/Romani Morpho-Syntax Database (RMS)Database of linguistic features of Romanigrammatical database, syntax, morphologyRomani
http://www.gaois.ie/crp/en/Parallel English-Irish corpus of legal texts4.5 English words of legal texts with Irish translationsparallel corpus, corpus, translation, law, legal textsEnglish, Irish
http://www.uni-stuttgart.de/lingrom/stein/corpus/Le Nouveau Corpus d'AmsterdamOld French literary texts between 11th and 14th centurylanguage history, historical corpus, Old FrenchFrench
http://www.meertens.knaw.nl/nfb/Nederlandse Familienamenbank300000 Dutch names and their locations in the Netherlandsonomastics, names, geographyDutch
http://www.livac.org/LIVAC Synchronous Corpus550 million word Chinese corpuscorpusChinese
http://www.cfilt.iitb.ac.in/indowordnet/IndoWordNetA wordnet of the languages of Indiawordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchiesHindi, Assamese, Bengali, Bodo, Gujarati, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, Telugu, Urdu
http://www.cfilt.iitb.ac.in/wordnet/webhwn/Hindi WordNetA wordnet of Hindiwordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchiesHindi
http://hebrewcorpus.nmelrc.org/HebrewCorpusA 150 million word corpus of HebrewcorpusHebrew
http://www.sfs.uni-tuebingen.de/GermaNet/GermaNetA wordnet of Germanwordnet, lexical database, dictionary, vocabulary, semantic relations, semantics, semantic hierarchiesGerman
http://argyf.fryske-akademy.eu/files/tdb/Frisian Languages DatabaseFrisian database containing audio and written corpora, including historical oneshistorical corpus, spoken corpus, audio, language history, Germanic languagesFrisian
http://eap.bl.uk/Endangered Archives ProgrammeA text archive of endangered languagestext archive, endangered languagesmultilingual
http://www.panlex.org/PanLexVocabulary and translations for 21 million expressions in about 10,000 language varieties, including Swadesh lists for about 2000 language varietiesvocabulary, word list, dictionary, Swadesh listmultilingual
http://www.lmp.ucla.edu/UCLA Language Materials ProjectContains teaching and learning materials for over 150 less commonly taught languages, including speaker and other information about the languagesspeaker information, bilingualism, language use, demographic datamultilingual
http://buckeyecorpus.osu.edu/The Buckeye Speech CorpusCorpus of high-quality recordings from 40 speakers in Columbus, Ohio, orthographically transcribed and phonetically labelledaudio corpus, annotated corpus, phonetics, phonology, speechEnglish
http://groups.inf.ed.ac.uk/switchboard/index.htmlThe Switchboard Corpus in NXTUpdated annotations of the Switchboard corpus of telephone conversations, annotatedannotated corpus, prosody, syntax, speech, conversational speech, telephone conversationEnglish
https://catalog.ldc.upenn.edu/LDC96L14CELEX2 CorpusCorpus of English, Dutch and German with additional lexical informationcorpus, word frequencyEnglish, Dutch, German
https://catalog.ldc.upenn.edu/LDC93S1TIMIT Acoustic-Phonetic Continuous Speech CorpusAudio corpus of 630 speakers of eight American English dialects with time-aligned orthographic, phonetic, and word transcriptionsannotated corpus, speech, phonetics, phonology, audio corpus, English dialectsEnglish
http://demeter.inf.ed.ac.uk/cross/publications.htmlTwitter FSD First Story Detection CorpusCorpus of \first stories\ (new events) from twittercorpus, web language, social media, web corpusEnglish
http://clic.cimec.unitn.it/amac/twitter_ngram/Rovereto Twitter N-Gram CorpusN-grams (up to 6-grams!) for 75 million English tweetscorpus, ngrams, word frequency, web language, social mediaEnglish
http://trec.nist.gov/data/tweets/Tweets2011 corpusA corpus of tweets collected January and February 2011corpus, web languageEnglish
http://demeter.inf.ed.ac.uk/cross/publications.htmlNewswire FSD First Story Detection CorpusA corpus of \first stories\ (new events) from newswirecorpus, web language, newspaper, media, political languageEnglish
http://dev.sslmit.unibo.it/corpora/corpus.php?path=&name=RepubblicaLa Repubblica CorpusA corpus of 380 million tokens of Italian newspaper texts, POS-tagged, lemmatized and genre categorizedcorpus, genre, topic, syntax, part of speech (POS), newspaper, media, political languageItalian
http://www.nzilbb.canterbury.ac.nz/onze.shtmlOrigins of New Zealand English (ONZE) CorpusA corpus of various stages of New Zealand Englishaudio corpus, phonetics, phonology, language history, historical corpusEnglish
http://laslab.org/resources/confusions/Corpus of noise-induced Spanish misperceptions/confusionsA corpus of 3235 noise-induced robust misperceptions in Spanishcorpus, phonetics, phonology, speech perceptionSpanish
http://www.cs.cmu.edu/~mfaruqui/suite.htmlWordSim353 evaluation benchmarksHuman similarity ratings for over 3000 word pairs, including syntactic relationssemantics, similarity, semantic relatednessEnglish, German, French, Arabic, Romanian, Spanish
https://github.com/mfaruqui/non-distributionalNon-distributional English word vectorsLarge lexicon with thesaurus, antonyms, color, connotations and valence information extracted through NLP proceduressemantics, lexicon, sentiment analysis, affect, emotional valence, antonymsEnglish
https://console.developers.google.com/storage/browser/wikipedia_multilingual_relations_v1/Semantic Relations from WikipediaA dataset of automatically extracted semantic relations from the multilingual Wikipedia corpussemantics, semantic relationsFrench, Russian, Chinese, Arabic, Hindi, Indonesian, Tagalog, Latvian, Swahili, Georgian
http://www.nlpado.de/~sebastian/data/tv_data.shtmlBilingual Formal/Informal Address CorpusCorpus of English and German sentences from novels tagged for formal and informal connotations, tokenized, lemmatized, POS-taggedannotated corpus, politeness, formal languageGerman, English
http://www.coli.uni-saarland.de/projects/salsa/corpus/German SALSA CorpusA large frame-based lexicon for German with semantic rolessemantic roles, frames, framenetGerman
http://www.nlpado.de/~sebastian/data/srl_data.shtmlCross-lingual projection of semantic rolesParallel corpora annotated for semantic rolesparallel corpus, corpus, translation, semantic rolesGerman, English
https://framenet.icsi.berkeley.edu/fndrupal/FrameNetA lexical database of English that specifies semantic frames and semantic roles, more than 10000 sensesframenet, lexical database, dictionary, vocabulary, semantic relationsEnglish
http://u.cs.biu.ac.il/~nlp/resources/downloads/annotation-of-discourse-references-relevant-for-entailment-inference/Discourse Reference CorpusPragmatically annotated corpus with information about coreference and bridgingreference, discourse, pragmatics, annotated corpus, entailment inference, coreferenceEnglish
http://clic.cimec.unitn.it/dm/Distributional Memory semantic databaseSemantic database of English based on distributional informationlexicon, semantic relatedness, relations, corpus-based semantics, co-occurrenceEnglish
http://www.cl.uni-heidelberg.de/~zeller/res/te-ger/index.mhtmlTextual Entailment Search Task Dataset for GermanA corpus of 3000 text/hypothesis pairs derived from web forum poststextual entailment, semantic inference, pragmatics, corpus, web languageEnglish
http://www.ims.uni-stuttgart.de/permalink/56cc6c89-c421-11e4-a5e6-000e0c3db68b.htmlDErivBase German Derivational LexiconA derivational lexicon for Germanmorphology, lexicon, dictionary, lemmaGerman
http://takelab.fer.hr/data/dmhr/Distributional Memory for CroatianSemantic database of Croatian based on distributional informationlexicon, semantic relatedness, relations, corpus-based semanticsCroatian
http://www.cl.cam.ac.uk/~fh295/simlex.htmlSimLex999 Semantic Relatedness DatasetA dataset of dataset of normed semantic similarity (rather than just word associations)semantics, semantic similarity, semantic relatedness, relations, concreteness, word associationEnglish
http://www.kuleuven.be/semlab/interface/index.phpDutch Word AssociationsA dataset of word associations in Dutchsemantics, word associationDutch
http://www.nltk.org/nltk_data/NLTK CorporaVariety of corpora and datasets built into the NLTK python librarynatural language processing, python, brown, australian broadcasting, alpino dutch treebank, treebank, CONLL, Europarl, Genesis, bible, gazeteer, C-Span, Gutenberg, KNB corpus, sentiment, NPS chat, opinion lexicon, multilingual wordnet, penn treebank, sentiwordnetEnglish, Portuguese, Spanish, Basque, Old English, Mandarin Chinese, Polish, Brazilian Portuguese
http://www.cstr.ed.ac.uk/research/projects/artic/accor.htmlEUR-ACCORCross-language recordings with EPG, laryngograph, nasal airflow, and audioarticulatory phonetics, articulation, speech production, Rhotenberg maskCatalan, English, French, German, Irish Gaelic, Italian, Swedish
http://www.cstr.ed.ac.uk/research/projects/artic/mocha.htmlMOCHA-TIMITDataset with audio, laryngograph and EMA recordings for English, constructed with the intention of training an automatic speech recognition systemarticulation, articulatory phonetics, speech production, electromagnetic articulography, tongue recordingEnglish
http://www.u.arizona.edu/~nwarner/WarnerMcQueenCutler.htmlEnglish Diphone Perceptual DatabasePhoneme categorizations based on a gated listening taskspeech perception, phonetics, phonology, psycholinguistics, phonetic information over timeEnglish
http://www.mpi.nl/world/dcsp/diphones/Dutch Diphone Perceptual DatabaseA total of 488,520 phoneme categorizations based on a gated listening task of 1,179 Dutch diphonesspeech perception, phonetics, phonology, psycholinguistics, phonetic information over timeDutch
http://www.linguateca.pt/acesso/corpus.php?corpus=SAOCARLOSNILC / San Carlos CorporaCollection of corpora of contemporary Portuguese, with part of speech tags (POS-tagged)corpus, annotated corpusPortuguese
http://www.clul.ul.pt/pt/recursos/183-reference-corpus-of-contemporary-portuguese-crpcCRPC Comparative Portuguese corpusLarge corpus containing texts from several varieties of Portuguese (European, Brazil, Angola, Cape Verde, Guinea-Bissau, Mozambique, Sao Tome and Principe, Goa, Macau, Timor-Leste)corpus, dialectal corpus, sociolinguisticsPortuguese
http://cipm.fcsh.unl.pt/CIPM Medieval Portuguese corpusHistorical corpus of medieval Portuguesehistorical corpus, language history, classical & medieval PortuguesePortuguese
http://www.letras.ufrj.br/nurc-rj/NURC-RJ Spoken Portuguese CorpusSpoken corpus of Brazilian Portuguesespoken corpus, audio recordings, phonetics, phonologyPortuguese
http://www.letras.ufrj.br/laborhistorico/LaborHistorico Historical Portuguese corpusOfficial historical corpus of the \A history of Brazilian Portuguese\ projecthistorical corpus, language historyPortuguese
https://sites.google.com/site/distributedlittleredhen/homeDistributed Little Red Hen Lab DatabasesResource directory for the UCLA NewsScape Library of International Television News; a TV News Archive that contains news programssemantics, gesture, phonetics, corpus, television (TV), media, politics, news, multimodal corpusEnglish
http://spokenchinesecorpus.nccu.edu.tw/NCCU Corpus of Spoken ChineseSpoken corpus of Chinesespoken corpus, audio recordings, phonetics, phonologyMandarin Chinese
http://andosl.anu.edu.au/andosl/Australian National Database of Spoken Language (ANDOSL)Phonetically annotated spoken language corpus of Australian Englishspoken corpus, audio recordings, phonetics, phonology, phonetically annotatedAustralian English
http://projects.ael.uni-tuebingen.de/backbone/moodle/BACKBONE Pedagogic Corpus of Video-Recorded InterviewsSpoken interviews with video recordings for several European languages, including second language recordingsspoken corpus, multimodal corpus, video, second language acquisition (SLA), bilingualismEnglish, French, German, Polish, Spanish, Turkish
http://serverdbt.ilc.cnr.it/altweb/Atlante Lessicale Toscano (ATL Lexical Atlas of Tuscany)Lexical atlas and demographic data; dialectal resource for Tuscan dialects in Italysociolinguistics, dialects, lexical atlas, language geography, dialectology, Italian dialectsItalian
https://catalog.ldc.upenn.edu/LDC2009T25Web 1T 5-gram ngrams for 10 European languagesN-grams (up to 5-grams) and frequency counts for 10 European languagesn-grams, word frequency, GoogleSwedish, Spanish, Romanian, Portuguese, Polish, Dutch, Italian, French, German, Czech
http://www.let.rug.nl/gosse/bin/Web1T5_freq.perlWeb 1T 5-gram database for DutchN-grams and frequency counts for Dutchn-grams, word frequency, GoogleDutch
http://rugtest16.service.rug.nl/gosse/Ngrams/Groningen Twitter CorpusDutch twitter corpus containing approximately 2.6 billion tweets and 28 billion tokens collected between January 2014 and December 2014, n-gram parsed up to 5-gramsn-grams, twitter, web corpus, web language, social mediaDutch
http://www.linguistics.ucsb.edu/research/santa-barbara-corpusSanta Barbara Corpus of Spoken American English (SBCSAE)249,000 words with transcriptions, audio and timestampsspoken corpus, audio recordings, phonetics, phonology, phonetically annotatedEnglish
http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/IMS-GECO.en.htmlIMS GECO Phonetic Convergence database46 dialogs (ca. 25 min long) between female German speakers, in speaker-visible and speaker-invisible contexts for the study of phonetic convergencespoken corpus, audio recordings, phonetics, phonology, multimodal corpus, phonetic convergence, accommodation, interpersonal synchrony, sociolinguistics, sociophoneticsGerman
http://quod.lib.umich.edu/cgi/c/corpus/corpus?c=micase;page=mbrowseMICASE Michigan Corpus of Academic Spoken English152 transcripts totaling 1.8 million words of academic spoken Englishspoken corpus, university language, registers, formal languageEnglish
http://u.cs.biu.ac.il/~koppel/BlogCorpus.htmBlog Authorship corpus681288 posts totaling 140 mio words from 19,320 bloggers, collected in 2004, balanced for gender; with age, gender and industry/occupation informationcorpus, web language, social media, web corpus, demographic data, sociolinguisticsEnglish
https://www.ausnc.org.au/AusNC Australian National CorpusCollection of Australian English corpora (including ACE, ART, AusLit, Braided Channels, COOEE, GCSAusE, ICE Corpus, MD Corpus, Monash Corpus); includes many registers and different time periods and transcribed speech from sociolinguistic interviews with gender informationcorpus, Australian English, dialects, spoken corpus, written language, literature, poetry, historical corpus, language history, varieties of EnglishEnglish
http://taiccm.org/Taiwan Corpus of Child Mandarin (TCCM)Taiwan Corpus of Child Mandarin (TCCM)corpus, child language, language acquisition, L1, children, learner corpusChinese
http://link.springer.com/article/10.3758/BF03193116Wurm 2007 Danger and Usefulness NormsA published research article that includes ratings for the danger and usefulness of wordssemantics, danger, usefulness, semantic norms, meaning, perceptual attributesEnglish
http://link.springer.com/article/10.3758/BRM.40.1.183#page-1Semantic Feature Production NormsSemantic feature production norms for a 456 words (objects and events)semantic features, semantics, feature norms, distinctive features, objects and events, propertiesEnglish
http://www.neuro.mcw.edu/ratings/Wisconsin Perceptual Attribute Ratings Database (MCWisc)Perceptual attribute norms for four sensory domains: sound, color, manipulation, motion; for 1402 words, including emotion ratings reflecting intensity and valenceperceptual attributes, manipulability, semantics, concepts, perception, manipulability, valence, affect, feelingEnglish
http://link.springer.com/article/10.3758/BF03195584Extension of Paivio normsExtension of Paivio et al. (1968) lexical normsgender ladenness, sexual language, stereotypes, age of acquisition (AOA), number of meanings, number of associates, emotionality, pleasantness, emotional valence, children's dictionaries, concreteness, meaningfulness, goodness, word frequency, imagery, imageability, language acquisition, children's word knowledge, lexical knowledge, word knowledgeEnglish
http://sumale.vjf.cnrs.fr/pronoms/Les marques personnelles dans les languages africainesDatabase of personal pronouns of African languagestypology, morphosyntax, morphology, syntax, personal pronouns, person markingmultilingual
http://typo.uni-konstanz.de/archive/intro/The Konstanz Universals ArchiveA list of proposed typological universalslanguage typology, Greenbergian universals, morphology, syntax, morphosyntax, word ordermultilingual
http://typo.uni-konstanz.de/rara/intro/index.phpDas grammatische RaritÀtenkabinettInformal list of grammatical rarities / typologically rare featurestypology, universals, rare features, syntax, morphosyntaxmultilingual
http://www.soundcomparisons.comSound comparisonsComparative atlas and map with audio samples of Germanic, Romance, Slavic, Celtic, Andean and Mapudungunlanguage geography, comparative linguistics, dialectology, Indo-European languages, pronunciation, sound patterns, phonetics, phonology, cognates, cognacymultilingual, including English, German, French, Italian, Spanish and Portuguese
http://langscape.umd.edu/LangscapeWorld map / linguistic atlas showing the location of languages and visualizing linguistic diversity across the globelanguage geography, linguistic diversity, typology, map, endangered languages, atlasmultilingual
http://sswl.railsplayground.net/SSWL Syntactic Structures of the World's LanguagesTypological database with syntactic features for 250+ languages of the worldtypology, morphology, morphosyntax, syntax, word order, universalsmultilingual
http://sealang.net/monkhmer/dictionary/SEAlang Mon-Khmer Etymological DictionaryDictionary for comparative and historical linguistics of Mon-Khmer languagesetymology, dictionary, lexical data, language history, historical linguistics, phylogenetics, comparative dictionaries, Asian languagesmultilingual
http://pollex.org.nz/Polynesian Lexicon Project (Pollex Online)Large-scale comparativ dictionary of Polynesian languagesPolynesian, Austronesian, lexical data, comparative dictionary, cognacy, cognates, historical linguistics, Pacific languages, word listsmultilingual
http://transnewguinea.org/TransNewGuinea.orgDatabase of languages from the Trans-New Guinea family and friends, encompassing 900+ languages and info on 1000+ wordsPacific languages, Trans-New Guinea family (TNG), Papua New Guinea (PNG), language history, historical linguistics, linguistic diversity, Austronesianmultilingual
http://starling.rinet.ru/new100/main.htmThe Global Lexicostatistical Database (GLD)Basic word lists for many of the world's languagescomparative linguistics, historical linguistics, phylogenetics, lexicostatistics, basic vocabulary, word lists, Swadesh listmultilingual
http://www.lapsyd.ddl.ish-lyon.cnrs.fr/Lyon-Albuquerque Phonological Systems Database (LAPSyD)Searchable database of basic phonological information on a wide sample of the world's languagesphonological typology, phoneme inventory, phonology, phonetics, consonants, vowels, syllable structure, linguistic stress, lexical tonesmultilingual
https://doi.org/10.3758/s13428-018-1099-3The Glasgow Normsnormative ratings for 5,553 English words on nine psycholinguistic dimensions: arousal, valence, dominance, concreteness, imageability, familiarity, age of acquisition, semantic size, and gender associationEnglishScott, G. G., Keitel, A., Becirspahic, M., Yao, B., & Sereno, S.C. (2018). The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods, 51, 1258–1270.
https://www.cogsci.mq.edu.au/research/resources/nwdb/nwdb.htmlARC Nonword Database358,534 nonwordsRastle, K., Harrington, J., & Coltheart, M. (2002). 358,534 nonwords: The ARC Nonword Database. Quarterly Journal of Experimental Psychology, 55A, 1339-1362.
https://lingualab.ca/en/project/norms-familiarity-perceptual-strengthFrench Canadian conceptual familiarity norms3,596 nouns and online data about them from 313 Canadian French speakersFrench
https://smallworldofwords.org/en/projectSmall World of WordsWord association and participant data for 100 primary, secondary and tertiary responses to 12,292 cues in English, 12,571 cues in DutchDutch