10

If someone generates word frequency lists somehow. For example by analysing a web corpus or data from keyboards. Is this form of list of words and their frequencies copyrightable? For example if someone chose to publish this list online, can this be protected using copyright law?

Edit: A word-frequency list is list of words and the amount of times they have occured in a text.

Edit2: For those who are interested, the jursidiction is India, but although I won't mind general info around the world as it will be helpful to all.

5 Answers 5

12

The underlying text(s) may be subject to copyright protection, but the individual words are not -- they are usually independently existing words of the language (there are invented words like "chrowl" which I have never seen appearing in a word frequency list). A frequency count is the simply a factual report about language use in a corpus. Even in the EU which admits some protection to fact databases, the rigorous objectivity of word frequency lists precludes copyright protection, see the Database Directive Art. 3:

  1. In accordance with this Directive, databases which, by reason of the selection or arrangement of their contents, constitute the author's own intellectual creation shall be protected as such by copyright. No other criteria shall be applied to determine their eligibility for that protection.

  2. The copyright protection of databases provided for by this Directive shall not extend to their contents and shall be without prejudice to any rights subsisting in those contents themselves.

What is protected in the original is the creative combination of words, which is lacking in a frequency list.

A wordlist is created purely by brow-sweat, or more typically, computer program. As a database, it lacks the bare minimum of creativity necessary for copyright protection. The program that creates the frequency counts would be protected as involving some creativity, but the product of that computation is not -- it is automatic.

3
  • 1
    The OP didn't specify in which country is set the question. Therefore it should be pointed out that under en.wikipedia.org/wiki/Sweat_of_the_brow a frequency list could be subject to copyright in the UK.
    – Pere
    Commented Jul 10, 2020 at 11:17
  • Isn't that now invalidated by the ECJ's 2012 opinion? At least for the moment.
    – emrys57
    Commented Jul 10, 2020 at 15:09
  • 1
    I'm not entirely sure if this answer is quoting the right part of the Database Directive. The EU admits copyright of databases whose "selection or arrangement of their contents" are original, but it also has a separate sui generis right for other databases, which is not called "copyright" and appears to have been omitted from this answer.
    – Kevin
    Commented Jul 10, 2020 at 18:12
8

Data is not copyrightable, but databases (structured, organized data) might be. This depends on the jurisdiction, e.g. database rights are recognized in the EU. Whereas copyright protects creative expression, database rights protect the effort that went into collecting and organizing the data.

Note that even when database rights apply, this doesn't prevent someone else from performing the same analysis and coming up with an equivalent database.

Besides the rights in the database as the whole, there's also the question of copyrightability of the individual entries in the database. For example, even short text snippets of a couple of words might already be subject to copyright. However, it is probably safe to assume that individual words are uncopyrightable, perhaps with exception of clearly creative words such as “Supercalifragilisticexpialidocious”.

Where the analysis of word frequency is based on copyrighted material, a license for processing that material may or may not be required, depending on jurisdiction. Copyright laws might have exceptions for scientific uses or non-commercial fair use, or might even consider this use entirely unproblematic in general.

2

Note that even when database rights apply, this doesn't prevent someone else from performing the same analysis and coming up with an equivalent database.

I think Anon gives good advice on this question, the only change I would make is in reference to the above.

It is in the nature of copyright (unlike patent) that independent creation cannot be an infrigement of an existing work. That is to say that if I create a melody, without reference to other works, and that melody is coincidentally identical to an existing work, then my melody does not infringe the other work.

There is case law on issues such as when one may be “exposed” to another work, or unconciously copy it.

The scenario you envisage may be one of those cases where the method of creation is the element protected, rather than the creation itself. This takes copyright closer to the concepts of patent. In this case that could mean that you had copyright protection for the software written to complete the analysis, but not for the output of that analysis.

2

Yes. There definitely lies copyright on any database established in the EU. A word frequency list is a database, and has a 15 year sui generis copyright. This also includes derivative works.

lightly remixing the wordlist counts as a derivative work, and falls under copyright protection.

The answer provided by "user6726" is incomplete. While automated frequency lists can be generated, they are usually error-prone. Raw data maybe can't be copyrighted, I'm not sure about that.

But a cleaned corpus is copyrighted. When tagging raw corpus data, there is a 35 to 5% inaccuracy in any automated text. This depends on the software used. This requires manual action, and it takes many hours to establish a reliable frequency list.

Especially Slavic languages, with their many inflections, prove quite difficult.

I can't comment on posts, but another user said something about using the same source text to establish a frequency list. It's very unlikely that a source is used multiple times. The frequency lists you get from analyzing a source text would vary vastly depending on the source text used.

It is very easy to spot when someone used a frequency list that they did not establish themselves if you have some experience building corpora/frequency lists.

Please see here for more information:

https://europa.eu/youreurope/business/running-business/intellectual-property/database-protection/index_en.htm

https://en.wikipedia.org/wiki/Database_right

Source: I'm a publisher specializing in these kind of things.

-1

I would argue 'no'. On the basis that a word frequency list is not a creative work. Anyone can generate that list at any time, and not even necessarily in the same way. It's more of an empirical study publication. Thus, by definition, is supposed to be testable... precisely by having others repeat the experiment, including the methodology, and producing largely similar results. Research studies are... I don't know if copyrightable is the right term, but certainly you're obliged to give credit where credit is due. The data itself, however, cannot be copyrightable. Patent law wouldn't apply either because it isn't substantively unique, contributory to innovation, or marketable, there isn't even a tangible component to a word list. All those same words are already listed in something called a dictionary... which predates such a study. The only true difference is the order in which they are written, which is based exclusively on measurable quantities - quantities that exist in the world, common - that anyone can perform and that has no real creative element.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .