8
$\begingroup$

I'm trying to generate random textual data based on regular expressions. I'd like to be able to do this in R, as I know that R does have regex capabilities. Any leads?

This question has come up before in forums (StackOverflow Post 1, StackOverflow Post 2, etc.), but they always mention solutions based on other programming platforms (perl, .NET, ...), not R.

$\endgroup$
3
  • 3
    $\begingroup$ I like the Xeger solution linked to in the first post: it obtains the deterministic FSA created by the system and then makes random transitions within it. (The snarky negative comments in the second post seem altogether ignorant of that simple, valid approach). AFAIK, nothing in R provides the equivalent capability. Consider using something like Xeger to generate a text file of strings and then read it into R for the intended statistical processing. $\endgroup$
    – whuber
    Commented Mar 4, 2011 at 22:59
  • 1
    $\begingroup$ Or make a "RXeger" wrapper package with rJava and than post it on CRAN ;-) $\endgroup$
    – user88
    Commented Mar 4, 2011 at 23:45
  • $\begingroup$ Could you explain how a regular expression--which typically describes an infinite class of strings--determines a probability distribution over that class? It seems you need much more information in order to determine the distribution. $\endgroup$
    – whuber
    Commented Jun 8, 2018 at 12:50

2 Answers 2

7
$\begingroup$

While generating random data from regular expressions would be a convenient interface, it is not directly supported in R. You could try one level of indirection though: generate random numbers and convert them into strings. For example, to convert a number into a character, you could use the following:

> rawToChar(as.raw(65))
[1] "A"

By carefully selecting the range of the random number to draw you can restrict your self to a desired set of ASCII characters that might correspond to a regular expression, e.g., to the character class [a-zA-Z].

Clearly, this is neither an elegant nor efficient solution, but it is at least native and could give you the desired effect with some boilerplate.

$\endgroup$
2
  • $\begingroup$ Thanks, @Matthias Vallentin. I've just used this for generation of a random length word based on a simple regular expression. I had been unable to do the same in Excel earlier. However, it would be nice to have a more robust approach in the future! $\endgroup$
    – drapkin11
    Commented Mar 5, 2011 at 1:28
  • 1
    $\begingroup$ (+1) For the function rawToChar which I did not know. $\endgroup$
    – gui11aume
    Commented Jun 17, 2012 at 16:05
1
$\begingroup$

Still not a perfect answer, however, Mark Heckmann has suggested using a random string generator which partially solves this problem:

GenRandomString <- function(n=1, lenght=12)
{
  randomString <- c(1:n)                  # initialize vector
  for (i in 1:n)
  {
    randomString[i] <- paste(sample(c(0:9, letters, LETTERS),
                                    lenght, replace=TRUE),
                             collapse="")
  }
  return(randomString)
}
GenRandomString(5,8)

Output: five random strings, 8 characters long

[1] "l42DjAtc" "jW6TdRZw" "5aAvMuDL" "iC3xOvst" "gqgSzE83"

This can be used for various cases, e.g generate keys, names, simulations, etc.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.