30

I would like to define a function which splits an input text into its constituent letters, so I can process each letter individually (the background to this question is that I would like to put each letter into an individual box, and would like to avoid calling the my boxing function by hand over and over).

So how would I go about splitting a text into its letters, so that for each letter, I can call an appropriate subfunction?

9 Answers 9

24

The answer depends a lot about what you mean by a "character" and how your input looks like (does it e.g. commands). One possibility is to use soul which contains a lot code to analyze text. E.g. you can get your boxes simply by misusing the \so command:

\documentclass{article}
\usepackage{soul}

\makeatletter
\def\SOUL@soeverytoken{%
 \fbox{\the\SOUL@token}}
\makeatother
\begin{document}
\so{abcADBC kdkkk dkdk kdkdk }
\end{document}
2
  • Soul's parsing features are great. Commented Sep 3, 2010 at 16:23
  • 1
    soul is nice as long as you are dealing with 8-bit characters (i.e. use pdfTeX); but if you use Unicode you should use XeTeX or LuaTeX and forget about soul. Commented Oct 29, 2012 at 12:21
18
+50

This seems a job for expl3. Let's say we want to split a string of characters into its constituents, for later processing. So we define a macro that takes two arguments: the string and the macro for the processing

\documentclass{article}

\ExplSyntaxOn

\NewDocumentCommand{\stringprocess}{ m m }
 {
  \egreg_string_process:nn { #1 } { #2 }
 }
\cs_new_protected:Npn \egreg_string_process:nn #1 #2
 {
  \text_map_inline:nn { #2 } { #1 { ##1 } }
 }
\ExplSyntaxOff

\newcommand{\boxchar}[1]{\fbox{\strut#1} } % leave a space after the box

\begin{document}

\stringprocess{\boxchar}{abcdef}

\stringprocess{\boxchar}{ábcdefß}

\end{document}

enter image description here

Different toy problem, but here “complex” character are more problematic, so I'll stick with ASCII. We want to input a string and get as a result a token list that contains only the digits in the string, separated by commas. We assume that the input is controlled, so it contains only alphanumeric characters.

All we need is to define a suitable auxiliary function, instead of the simple \boxchar used before. However, it's better to use sequences instead of token lists, so I'll rework the solution from the start.

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\seq_new:N \l__egreg_input_string_seq
\seq_new:N \l__egreg_output_string_seq

\cs_new_protected:Npn \egreg_string_process:nnn #1 #2 #3
% #1 = preprocess macro
% #2 = postprocess macro
% #3 = string
 {
  \seq_clear:N \l__egreg_output_string_seq
  \seq_set_split:Nnn \l__egreg_input_string_seq { } { #3 }
  \seq_map_inline:Nn \l__egreg_input_string_seq
   { #1 { ##1 } }
  #2
 }

\NewDocumentCommand{\boxchars}{ m }
 {
  \egreg_boxchars:n { #1 }
 }

\cs_new_protected:Npn \egreg_boxchars:n #1
 {
  \egreg_string_process:nnn
   { \egreg_fbox_strut:n }
   { \seq_use:Nnnn \l__egreg_output_string_seq { ~ } { ~ } { ~ } }
   { #1 }
 }
\cs_new_protected:Npn \egreg_fbox_strut:n #1
 {
  \seq_put_right:Nn \l__egreg_output_string_seq { \fbox { \strut #1 } }
 }
\ExplSyntaxOff

\begin{document}
\boxchars{abcdef}
\end{document}

This would give the same result as before, but \unskip wouldn't be necessary.

  1. The string passed as fourth argument to \egreg_string_process:nnnn is split into its components; the third argument is the delimiter of the components, which can also be empty; an auxiliary "output" sequence is cleared for possible subsequent use by the preprocessing or postprocessing macros;

  2. Each element of the sequence is passed to the "preprocessing macro", which should be a one argument function;

  3. The "postprocess" macro is applied.

In the example, the preprocess macro stores \fbox, the postprocess macro just produces the items in the output sequence, separated by spaces.

What about the toy problem? The preprocess macro should test whether the item is a digit and, in this case, add it to the output sequence. Let's add this code before \ExplSyntaxOff

\cs_new_protected:Npn \egreg_store_digit:n #1
 {
  \bool_if:nT
   {
    \int_compare_p:n { `#1 >= `0 } && \int_compare_p:n { `#1 <= `9 }
   }
   {
    \seq_put_right:Nn \l__egreg_output_string_seq { #1 }
   }
 }
\cs_new:Npn \egreg_print_list_commas:n #1
 {
  \seq_use:Nnnn \l__egreg_output_string_seq { , } { , } { , }
 }

\NewDocumentCommand{\extractdigits}{ m }
 {
  \egreg_string_process:nnnn
   { \egreg_store_digit:n }
   { \egreg_print_list_commas:n }
   { }
   { #2 }
 }

and try with

\begin{document}
\extractdigits{a1b2c3}
\end{document}

to get

1,2,3

Complete code for the toy problem:

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\seq_new:N \l__egreg_input_string_seq
\seq_new:N \l__egreg_output_string_seq

\cs_new_protected:Npn \egreg_string_process:nnnn #1 #2 #3 #4
% #1 = preprocess macro
% #2 = postprocess macro
% #3 = separator
% #4 = string
 {
  \seq_clear:N \l__egreg_output_string_seq
  \seq_set_split:Nnn \l__egreg_input_string_seq { #3 } { #4 }
  \seq_map_inline:Nn \l__egreg_input_string_seq
   { #1 { ##1 } }
  #2
 }

\cs_new_protected:Npn \egreg_store_digit:n #1
 {
  \bool_if:nT
   {
    \int_compare_p:n { `#1 >= `0 } && \int_compare_p:n { `#1 <= `9 }
   }
   {
    \seq_put_right:Nn \l__egreg_output_string_seq { #1 }
   }
 }
\cs_new:Npn \egreg_print_list_commas:n #1
 {
  \seq_use:Nnnn \l__egreg_output_string_seq { , } { , } { , }
 }

\NewDocumentCommand{\extractdigits}{ O{} m }
 {
  \egreg_string_process:nnnn
   { \egreg_store_digit:n }
   { \egreg_print_list_commas:n }
   { #1 }
   { #2 }
 }

\ExplSyntaxOff

\begin{document}
\extractdigits{a1b2c3d}
\end{document}
0
13

The xstring package provides macros for splitting strings, extracting characters of strings and replacement in strings. This may be combined with the forloop and ifthen packages if needed.

1
  • 1
    xstring seems to be a very well thought package. Thanks!
    – yannisl
    Commented Sep 3, 2010 at 19:55
12

Stop! I think we cannot finish without a LuaLaTeX example:). Here it is:

\documentclass{article}
\usepackage{luacode}

\begin{luacode}
    function GetDigits(str) str:gsub("%d",tex.print) end
\end{luacode}

\def\getDigits#1{\directlua{GetDigits("#1")}}

\begin{document}
\getDigits{a123b4c5}
\end{document}

Ok, this is a really minimal example. Instead of the simple tex.print you can do what you want with each of the digits (print, do calculations or store them for later use). If you want boxed digits try this:

str:gsub("%d", function(d) tex.print(string.format("\\fbox{\\strut%s}",d)) end)
11

It's fairly easy to write such a macro for yourself, at least for simple cases. Here's one I just wrote.

\newcommand*\foreachletter[2]{%
        \begingroup
        \let\templettercommand#1%
        \let\tempspacecommand#2%
        \catcode`\ 12
        \foreachlettergo
}
\def\foreachlettergo#1{%
        \testletter#1\relax
        \endgroup
}
\def\testletter#1#2\relax{%
        \if#1\otherspace
                \tempspacecommand
        \else
                \templettercommand{#1}%
        \fi
        \ifx\relax#2\relax
                \let\next\relax
        \else
                \let\next\testletter
        \fi
        \next#2\relax
}
\catcode`\ 12
\def\otherspace{ }%
\catcode`\ 10

\foreachletter\fbox\textvisiblespace{Here is some text!}

The \foreachletter macro takes three arguments, the first is a macro that takes a single argument (like \fbox) that will be expanded for each nonspace token in the third argument. The second argument is the macro that will be expanded for each space token. The third argument should only contain letters, "other" (for example punctuation), and spaces.

This doesn't work in all cases, but I think it's more readable than using \futurelet which is probably the better way to do it.

1
  • 5
    Of course, one should go with an existing, well-tested solution. I just wanted to point out that it's not black magic.
    – TH.
    Commented Sep 4, 2010 at 7:26
3

Here is a version that uses the \literate option from the the listings package. After a call to \ExtractDigits{a123b4c5}, all the digits are extracted and available as a comma separated list in the macro \ListOfDigits for further processing:

enter image description here

Notes:

  • The options to the literate do need to include all the characters that need to be ignored. So if you want to allow for any alphabetic character than all 52 characters need to be listed (26 upper case, and 26 lower case). If you want to allow for punctuation or other characters those can be added as well.

  • I used \IfStrEq form the xstring package to ensure that a comma separator was only added after the first member of the list. If this additional package is not desired it should easy to rewrite this to not need that package.

  • The showframe package was used just to show the page margins.

  • It seems that if the last parameter, the <length> to the literate option is specified as zero, then the replacement text is not executed, but using -1 seemed to work. Not sure if this is a feature, as I do not see it in the documentation, but it does what was required without adding space in the output.

    I would think that there would have been some sort of wildcard option, but I have not been able to find that yet.

References:

Code:

\documentclass{article}
\usepackage{listings}
\usepackage{pgffor}
\usepackage{xstring}
\usepackage{showframe}

% https://tex.stackexchange.com/questions/14393/how-keep-a-running-list-of-strings-and-then-process-them-one-at-a-time
\newcounter{NumberOfDigits}
\def\ListOfDigits{}
\makeatletter
\newcommand{\AddToListOfNumbers}[1]{%
    \IfStrEq{\ListOfDigits}{}{}{\g@addto@macro\ListOfDigits{,}}%
    \g@addto@macro\ListOfDigits{#1}%
    \stepcounter{NumberOfDigits}%
}
\makeatother
\lstdefinestyle{FormattedNumber}{%
    literate={0}{\AddToListOfNumbers{0}}{-1}%
             {1}{\AddToListOfNumbers{1}}{-1}%
             {2}{\AddToListOfNumbers{2}}{-1}%
             {3}{\AddToListOfNumbers{3}}{-1}%
             {4}{\AddToListOfNumbers{4}}{-1}%
             {5}{\AddToListOfNumbers{5}}{-1}%
             {6}{\AddToListOfNumbers{6}}{-1}%
             {7}{\AddToListOfNumbers{7}}{-1}%
             {8}{\AddToListOfNumbers{8}}{-1}%
             {9}{\AddToListOfNumbers{9}}{-1}% 
             {a}{}{0}%
             {b}{}{0}%
             {c}{}{0}%
            % .... code here missing ... list ALL characters that are to be ignored
             {y}{}{0}%
             {z}{}{0}%
             {A}{}{0}%
             {B}{}{0}%
             {C}{}{0}%
            % .... code here missing ... list ALL characters that are to be ignored
             {Y}{}{0}%
             {Z}{}{0}%
}
\newcommand{\ExtractDigits}[1]{%
    \setcounter{NumberOfDigits}{0}%
    \lstinline[style=FormattedNumber]{#1}%
}

\begin{document}
\noindent
\ExtractDigits{a123b4c5}%
The number of digits is \arabic{NumberOfDigits}.
The digits are:\par%
\foreach \Digit in \ListOfDigits {%
    ~\Digit\par%
}%
\end{document}
3

Just a quick hack using the coolstr package for \substr and the ifthen package for the whiledo

  \newcounter{scancount}
   \whiledo{\value{scancount}<8}{%
      \stepcounter{scancount} 
      \thescancount 
      \substr{abcdefgh}{\thescancount}{1}
   }

This will print:

 1a 2b 3c 4d 5e 6f 7g 8h
2
  • Which package? CTAN has no coolstring package. Do you mean substr? Commented Jan 25, 2013 at 12:16
  • @MartinSchröder My bad, should read coolstr, will fix it thanks.
    – yannisl
    Commented Jan 25, 2013 at 12:28
3

This is based on some undocumented pgf routine. It doesn't require packages. You can redefine \dosomething to fit your needs.

\documentclass{article}
\makeatletter
\def\prg@token@stop{\prg@token@stop}% <- thanks to Joseph Wright
\newcommand\prg[1]{\expandafter\prg@i\@firstofone#1\prg@token@stop}
\def\prg@i#1{%
    \ifx\pgfmath@token@stop#1%
    \else
     \dosomething{#1}
      \expandafter\prg@i
    \fi}  
\makeatother
\begin{document}
\newcommand\dosomething[1]{\fbox{#1}}
\prg{abcdef}
\end{document}

enter image description here

2
  • 1
    I'd be tempted to \def\prg@token@stop{\prg@token@stop} so it doesn't match other undefined tokens
    – Joseph Wright
    Commented Aug 23, 2019 at 8:55
  • @JosephWright Yes, that is a good suggestion.
    – user194703
    Commented Aug 23, 2019 at 9:00
1

The newly released tokcycle package can do this, all the while keeping track of macros, spaces, groups (explicit and implicit), active characters, etc.

\pars are not a problem and, in this MWE, spaces are expanded a bit for clarity.

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{tokcycle}
\makeatletter
\begin{document}
\catcode`?=\active
\def ?{package}
\tokencycle
  {\addcytoks{\fbox{#1}}}% CHARACTERS
  {\addcytoks{\textup{\{}}\processtoks{#1}\addcytoks{\textup{\}}}}% GROUPS
  {\addcytoks{#1}}% MACROS
  {\addcytoks{\hspace{2em minus 1em}}}% SPACES
Here is my example \textit{of a \bgroup \scshape Token Cycle\egroup{} using}
the

\textsf{tokcycle} ?.
\endtokencycle
\end{document}

enter image description here

You must log in to answer this question.