Write a spam-detection program

Question

You work for a social media platform, and are told to create a program in a language of your choice that will automatically flag certain post titles as "spam".

Your program must take the title as a string as input and output a truthy value if the title is spam, and a falsey value if not.

To qualify as non-spam, a title must conform to the following rules, otherwise it is spam:

A title can only contain spaces and the following characters: a-z, A-Z, 0-9, -, _, ., ,, ?, !
A title cannot have more than one Capital Letter per word
A title cannot have more than one exclamation mark or question mark
A title cannot have more than three full-stops (.)
A title cannot have more than one comma (,)

Test cases:

Input: How On eaRth diD tHis happeN
Output: False

Input: How on earth did this happen 🔊
Output: True

Input: How ON earth did this happen
Output: True

Input: How on earth did this happen??
Output: True

Input: How on earth did this happen?!
Output: True

Input: How on earth did this happen!!
Output: True

Input: How! on! earth! did! this! happen!
Output: True

Input: How on earth did this happen! !
Output: True

Input: How on earth did this happen?
Output: False

Input: How on earth did this happen!
Output: False

Input: How on earth did this happen...
Output: False

Input: How.on.earth.did.this.happen
Output: True

Input: How.on.earth.did this happen
Output: False

Input: How, on, earth did this happen
Output: True

Input: How, on earth did this happen
Output: False

Input: How_on_earth_did_this_happen
Output: False

Input: How-on-earth-did-this-happen
Output: False

Input: How on earth did (this) happen
Output: True

Input: How on earth did "this" happen
Output: True

Input: How on earth did 'this' happen
Output: True

Input: How on earth did *this* happen
Output: True

Input: How on earth did [this] happen
Output: True

FAQ

Q: What is a valid title?

A: Single-character titles are valid. Your program has to validate a string that is not completely whitespace against the bullet-point rules above.

Q: What is a word?

A: Split the title string by a space. Each item in the resulting array is considered a word.

This is code-golf, so the shortest answer in bytes wins.

Suggest test case: 这到底怎么回事 Most regexp based answers which support Unicode would be confused if \w is used. — tsh, Commented Feb 10, 2022 at 7:43

nununoisy · Accepted Answer · 2022-02-08 16:13:36Z

6

JavaScript (Node.js), 72 71 68 64 bytes

-1 byte thanks to @ThisFieldIsRequired

-3 bytes inspired by @Neil's Retina answer

-2 bytes thanks to @emanresuA (see @Neil's answer)

-2 bytes by modifying final test slightly

m=>/[^\w-.,?! ]|[A-Z]\S*[A-Z]|[!?].*[!?]|,.*,|(\..*){4}/.test(m)

Try it online!

Regex abuse FTW.

Here's how it works:

[^\w-.,?! ] matches any character that isn't allowed.
[A-Z]\S*[A-Z] matches two uppercase letters without a space in between, i.e. two capitals in a single word
[!?].*[!?] matches two exclamation/question marks with anything in between
,.*, matches two commas with anything in between
(\..*){4} matches four periods with anything in between

If you put these together in a single regex as alternates, you get a spam filter that matches all criteria.

edited Feb 8, 2022 at 16:13

answered Feb 7, 2022 at 16:35

nununoisy

8014 silver badges9 bronze badges

1

\$\begingroup\$ -1 byte by using test. \$\endgroup\$
– ophact
Commented Feb 7, 2022 at 16:38
\$\begingroup\$ I thought I had done it switching from search to match but you got even further down. Thanks for that! \$\endgroup\$
– nununoisy
Commented Feb 7, 2022 at 16:41
\$\begingroup\$ it doesn't save any space, but for fun you can change -., to ,-., because ASCII \$\endgroup\$
– Dave
Commented Feb 8, 2022 at 23:35
\$\begingroup\$ @Dave I tried it and it broke the code???? Try It Online! \$\endgroup\$
– Aiden Chow
Commented Feb 9, 2022 at 4:00
\$\begingroup\$ @AidenChow it needs to be ,-. not .-,. The order matters! Try to figure out why 😉 \$\endgroup\$
– Dave
Commented Feb 9, 2022 at 18:03

Add a comment |

veqtrus · Accepted Answer · 2022-02-12 18:51:09Z

C (gcc), 251 219 215 199 197 193 189 184 181 bytes

-32 bytes thanks to @ceilingcat

-4 bytes by subtracting 43 from c

-16 bytes by moving comparisons inside ternaries

-2 bytes by removing unneeded brackets

-4 bytes by realising that ++x>1 is equivalent to x++

-4 bytes by rearranging outer ternary and adjusting subtracted amount

-5 bytes by moving checks inside loop condition

-3 bytes by using the n array for storing the output.

#define C(l,h)c>l&c<h?n[l]++
f(char*s){int c,n[64]={1};while((c=*s++)&&!(c-=41,c+9?*n=c+8&&c-22?C(55,82)<0:C(23,50):C(6,17)<0:C(4,6)>2:C(2,4):c-4&&c-54:n[1]++:(n[23]=0)));return*n;}

Try it online!

The C macro increments a counter for a character class. I used decimal literals instead of char literals. When a space is encountered the uppercase counter is reset.

\$\begingroup\$ Welcome to Code Golf, and nice answer! \$\endgroup\$
– Rydwolf Programs
Commented Feb 8, 2022 at 4:43 — Rydwolf Programs, Commented Feb 8, 2022 at 4:43

Ginger · Accepted Answer · 2022-02-07 19:03:30Z

5

Python, 177 171 bytes

This requires the re module, so that adds an additional 9 bytes.

lambda i:any([re.search('[^a-zA-Z0-9\-.\,\?\! ]',i),*[len(re.findall("[A-Z]",w))>1for w in i.split(" ")],len(re.findall('[?!]'))>1,i.count(".")>3,i.count(",")>1])

Attempt This Online!

edited Feb 7, 2022 at 19:03

answered Feb 7, 2022 at 14:29

Ginger

5,7001 gold badge20 silver badges54 bronze badges

\$\begingroup\$ Use r'[^\w -.,?!]' to save some bytes (note that the underscore is allowed). Also why do you count 177 bytes when the link says 170? \$\endgroup\$
– Parcly Taxel
Commented Feb 7, 2022 at 14:49
\$\begingroup\$ @ParclyTaxel because of import re \$\endgroup\$
– Larry Bagel
Commented Feb 7, 2022 at 14:55
\$\begingroup\$ @BgilMidol but in that case wouldn't it be 179 bytes? \$\endgroup\$
– Parcly Taxel
Commented Feb 7, 2022 at 15:11
\$\begingroup\$ @ParclyTaxel I considered that, but the punctuation marks are counted seperately. With that regex, I could have, say, one comma, one question mark, and one exclamation and it would add up to be too much despite those all being under their respective limits. \$\endgroup\$
– Ginger
Commented Feb 7, 2022 at 15:52
\$\begingroup\$ I also fixed the link. \$\endgroup\$
– Ginger
Commented Feb 7, 2022 at 15:52

| Show 7 more comments

Neil · Accepted Answer · 2022-02-08 17:22:34Z

5

Retina 0.8.2, 55 53 51 bytes

[^-.,?!\w ]|[?!].*[?!]|,.*,|(\..*){4}|[A-Z]\S*[A-Z]

Try it online! Link includes test cases. Edit: Saved 2 bytes thanks to @Ausername and another 2 bytes thanks to @nununoisy. Simply reports on the number of banned patterns it finds, so for some spam the truthy value might be greater than 1; if this is undesirable, 1` can be prefixed to the program which limits the count to 1. Explanation:

[^-.,?!\w ]     Check for illegal characters.
[?!].*[?!]      Check for two question or exclamation marks.
,.*,            Check for two commas.
(\..*){4}       Check for four or more full stops.
[A-Z]\S*[A-Z]   Check for two uppercase letters in the same word.

edited Feb 8, 2022 at 17:22

answered Feb 7, 2022 at 16:16

Neil

173k12 gold badges72 silver badges276 bronze badges

\$\begingroup\$ Can [^ ] be \w? \$\endgroup\$
– emanresu A
Commented Feb 8, 2022 at 10:09
\$\begingroup\$ @emanresuA The question says words are only split by spaces, so for instance "O.K." would be an illegal word. \$\endgroup\$
– Neil
Commented Feb 8, 2022 at 10:10
\$\begingroup\$ Ok. What about \S? \$\endgroup\$
– emanresu A
Commented Feb 8, 2022 at 10:10
\$\begingroup\$ @emanresuA Yes, that works, thanks! \$\endgroup\$
– Neil
Commented Feb 8, 2022 at 10:14
\$\begingroup\$ -2 bytes by changing \.(.*\.){3} to (\..*){4}. \$\endgroup\$
– nununoisy
Commented Feb 8, 2022 at 16:14

| Show 1 more comment

Kevin Cruijssen · Accepted Answer · 2022-02-07 15:54:53Z

05AB1E, 38 37 36 bytes

žj…,!?©… -.JÃÊ·IS#.uOI®S¢ā£OI'.¢;M2@

Try it online or verify all test cases.

Explanation:

žj              # "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_"
  …,!?          # Push string ",!?"
      ©         # Store it in variable `®` (without popping)
       … -.     # Push string " -."
           J    # Join all three strings on the stack together
            Ã   # Only keep those characters from the (implicit) input
             Ê  # Check if it's now NOT equal to the (implicit) input
              · # Double it (2 if truthy; 0 if falsey)
I               # Push the input
 S              # Convert it to a list of characters
  #             # Split it on spaces
   .u           # Check for each character if it's an uppercased letter
     O          # Sum those checks for each word
 ®S             # Push [",","!","?"] (variable `®` as list of characters)
I  ¢            # Count these characters in the input
    ā           # Push a list in the range [1,length] (without popping): [1,2,3]
     £          # Split the counts into those parts: [[a],[b,c],[]]
                # (a=count of ","; b=count of "!"; c=count of "?")     
      O         # Sum each inner list: [a,b+c,0]
I               # Push the input yet again
 '.¢           '# Count the amount of "." in the input
    ;           # Halve it
M               # Push the largest number of the stack (including within lists)
 2@             # Check if this max is ≥2
                # (after which it is output implicitly as result)

Neil · Accepted Answer · 2022-02-07 16:28:07Z

Charcoal, 52 bytes

��⟦⊙θ¬№⁺ !,-.?_⭆⁶²⍘λφι‹¹⁺№θ!№θ?‹¹№θ,‹³№θ.⊙⪪θ ‹¹ＬΦι№αλ

Try it online! Link is to verbose version of code. Outputs a Charcoal boolean, i.e. - for spam, nothing if not. Explanation:

⌈⟦

Output if any of the following is true.

⊙θ¬№⁺ !,-.?_⭆⁶²⍘λφι

Check whether any characters aren't contained in the permitted list including the 62 alphanumeric characters used by default for base conversion.

‹¹⁺№θ!№θ?

Check whether there is more than one exclamation or question mark.

‹¹№θ,

Check whether there is more than one comma.

‹³№θ.

Check whether there are more than three full stops.

⊙⪪θ ‹¹ＬΦι№αλ

Check whether any word has more than one upper case letter.

DeathIncarnate · Accepted Answer · 2022-02-09 00:37:05Z

Burlesque (no RegEx), 103 88 81 76 74 bytes

Jwd{qsnfl2.<}aljJbc".,?!"qCNZ]^p.+1<=j1<=&&j3<=&&jNBqrifn" -_.,?!"\\z?&&&&

Try it online!

Almost certainly a shorter answer possible. This is pretty brute force.

Jwd{qsnfl2.<}alj  # Check double caps

J       # Duplicate input
wd      # Split into words
{      
 qsn    # isUpper
 fl     # filter length
 2.<    # <2
}       #
al      # All
j       # Swap input to top of stack

Jbc".,?!"qCNZ]^p.+1<=j1<=&&j3<=&&j  # Check char counts

J       # Duplicate input
bc      # Box and repeat infinitely
".,?!"  # String
qCN     # Count occurences
Z]      # Zip with count (return list of counts for each)
^p      # Push list to stack
.+      # Add ?s and !s
1<=j    # ?+! <= 1
1<=&&j  # , <= 1
3<=&&   # . <= 3

NBqrifn" -_.,?!"\\z?  # Check valid chars

NB          # Remove duplicates
qri         # Quoted is alphanum
fn          # Filter not
" -_.,?!"   # Valid non-letter
\\          # List diff
z?          # Empty

&&&&        # Reduce all 3 by ands

ophact · Accepted Answer · 2022-02-07 16:33:59Z

2

JavaScript (Node.js), 132 bytes

n=>n.split` `.some(w=>/[^\w-.,?!]/.test(w)+(F=C=>w.split(C).length-1,c+=F`.`,d+=F(/[?!]/),e+=F`,`,F(/[A-Z]/)>1),c=d=e=0)|c>3|d>1|e>1

Try it online!

If you want to be completely sure that the answer works, add a backslash before the dash in the first regular expression. The code above passes all test cases, but a comment on the python answer seems to indicate that there should be a backslash or space before the dash. If anyone could confirm or disprove the statement above, it would be very helpful.

edited Feb 7, 2022 at 16:33

answered Feb 7, 2022 at 16:28

ophact

3,1641 gold badge6 silver badges24 bronze badges

\$\begingroup\$ The dash is usually used for a range, however in an ECMAScript regex you can't make a character class part of a range, so it gets treated as a dash only if it follows the \w or is at the end of the group. \$\endgroup\$
– nununoisy
Commented Feb 7, 2022 at 16:36

Add a comment |

Seggan · Accepted Answer · 2022-02-22 22:49:20Z

Vyxal, 62 58 57 56 bytes

`[^\w.,?! -]`ẎL‛?!øB?ẎL1>?⌈ƛ`[A-Z]`nẎL1>;a?\.O3>?\,O1>Wa

Try it Online!

My first Vyxal answer, and I'm loving this language. So much more intuitive than Jelly. 99% sure this can be golfed more.

Explanation:

`[^\w.,?! -]`?ẎL‛?!øB?ẎL1>?⌈ƛ`[A-Z]`nẎL1>;a?\.O3>?\,O1>Wa ; Takes the word as input
`[^\w.,?! -]`ẎL                                          ; Length of any matched of illegal characters (0 if no matches)
               ‛?!                                       ; The string '?!'
                  øB                                     ; Bracketify: converts '?!' to '[?!]'
                    ?ẎL                                  ; Find all '?' and '!' and count them
                       1>                                ; More than 1?
                         ?⌈                              ; Split the input on spaces
                           ƛ            ;                ; Mapping lambda: maps all the words using the following criteria
                            `[A-Z]`nẎL                   ; How many capital letters in the word?
                                      1>                 ; More than 1?
                                         a               ; Any truthy? (i.e. any words with more than 1 capital letter?)
                                          ?\.O           ; Count full stops in string
                                              3>         ; More than 3?
                                                ?\,O     ; Count commas in string
                                                    1>   ; More than 1?
                                                      W  ; Turn the stack into a list
                                                       a ; Any truthy? (i.e. are any of the conditions true?)

Vyxal, 56 bytes

`[^\w-.,?! ]|[A-Z]\S*[A-Z]|[!?].*[!?]|,.*,|(\..*){4}`?ẎL

Try it Online!

A different version based off of the regex Node.js answer

Stack Exchange Network

Write a spam-detection program

FAQ

9 Answers 9

JavaScript (Node.js), 72 71 68 64 bytes

C (gcc), 251 219 215 199 197 193 189 184 181 bytes

Python, 177 171 bytes

Retina 0.8.2, 55 53 51 bytes

05AB1E, 38 37 36 bytes

Charcoal, 52 bytes

Burlesque (no RegEx), 103 88 81 76 74 bytes

JavaScript (Node.js), 132 bytes

Vyxal, 62 58 57 56 bytes

Vyxal, 56 bytes

Not the answer you're looking for? Browse other questions tagged
code-golf
string
decision-problem
or ask your own question.

Hot Network Questions

Write a spam-detection program

FAQ

9 Answers 9

JavaScript (Node.js), 72 71 68 64 bytes

C (gcc), 251 219 215 199 197 193 189 184 181 bytes

Python, 177 171 bytes

Retina 0.8.2, 55 53 51 bytes

05AB1E, 38 37 36 bytes

Charcoal, 52 bytes

Burlesque (no RegEx), 103 88 81 76 74 bytes

JavaScript (Node.js), 132 bytes

Vyxal, 62 58 57 56 bytes

Vyxal, 56 bytes

Not the answer you're looking for? Browse other questions tagged code-golfstringdecision-problem or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
code-golf
string
decision-problem
or ask your own question.