8

I am using scrapy to fetch some resources, and I want to make it a cron job which can start every 30 minutes.

The cron job:

0,30 * * * * /home/us/jobs/run_scrapy.sh`

run_scrapy.sh:

#!/bin/sh
cd ~/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
pkill -f $(pgrep run_scrapy.sh | grep -v $$)
sleep 2s
scrapy crawl good

As the script shows I tried to kill the script process and the child process (scrapy) also.

However when I tried running two instances of the script, the newer instance does not kill the older one.

How to fix that?


Update:

I have more than one .sh scrapy script which run at different frequency configured in cron.


Update 2 - Test for Serg's answer:

All the cron jobs have been stopped before I run the test.

Then I open three terminal windows say they are named w1 w2 and w3, and run the commands in the following orders:

Run `pgrep scrapy` in w3, which print none.(means no scrapy running at the moment).

Run ./scrapy_wrapper.sh in w1

Run pgrep scrapy in w3 which print one process id say it is 1234(means scrapy have been started by the script)

Run ./scrapy_wrapper.sh in w2 #check the w1 and found the script have been terminated.

Run pgrep scrapy in w3 which print two process id 1234 and 5678

Press <kbd>Ctrl</kbd>+<kbd>C</kbd> in w2 (twice)

Run pgrep scrapy in w3 which print one process id 1234 (means scrapy of 5678 have been stopped)

At this moment, I have to use pkill scrapy to stop scrapy with id of 1234

Zanna
  • 72,312
hguser
  • 315

9 Answers9

9

Better approach would be to use a wrapper script, that will call the main script. This would look like this:

#!/bin/bash
# This is /home/user/bin/wrapper.sh file
pkill -f 'main_script.sh'
exec bash ./main_script.sh

Of course wrapper has to be named differently. That way, pkill can search only for your main script. This way your main script reduces to this:

#!/bin/sh
cd /home/user/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl good

Note that in my example I am using ./ because script was in my current working directory. Use full path to your script for best results

I have tested this approach with a simple main script that just runs infinite while loop and wrapper script. As you can see in screenshot, launching second instance of wrapper kills previous

enter image description here

Your script

This is just example. Remember that I have no access to scrapy to actually test this so adjust this as needed for your situation.

Your cron entry should look like this:

0,30 * * * * /home/us/jobs/scrapy_wrapper.sh

Contents of scrapy_wrapper.sh

#!/bin/bash
pkill -f 'run_scrapy.sh'
exec sh /home/us/jobs/run_scrapy.sh

Contents of run_scrapy.sh

#!/bin/bash
cd /home/user/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
# sleep delay now is not necessary
# but uncomment if you think it is
# sleep 2
scrapy crawl good
Arronical
  • 20,241
5

If I understand what you are doing correctly, you want to call a process every 30 minutes (via cron). However, of when you start a new process via cron, you want to kill any existing versions still running?

You could use the "timeout" command to ensure that if scrappy if forced to terminate if it is still running after 30 minutes.

This would make your script look like this:

#!/bin/sh
cd ~/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
timeout 30m scrapy crawl good

note the timeout added in the last line

I have set the duration to "30m" (30 minutes). You might want to choose a slightly shorter time (say 29m) to ensure that the process has terminated before the next job starts.

Note that if you change the spawn interval in crontab, you will have to edit the script as well

Nick Sillito
  • 1,636
2

Great. A little update which allows for the script to determine itself its own filename without hardcoding:

#!/bin/bash 
# runchecker.sh
#this script obtains the name of the script and then
#checks if the script is already running or not
#if scripts already runs it exits

filename=$(basename $0) echo running now $filename

pids=($(pidof -x $filename))

if [ ${#pids[@]} -gt 1 ] ; then echo "Script already running by pid ${pids[1]}" exit fi

echo "Starting service " sleep 1000

enter code here

RFV-370
  • 21
1

As pkill terminates only the specified process, we should terminate its child subprocesses using -P option. So the modified script will look like this:

#!/bin/sh

cd /home/USERNAME/spiders/goods
PATH=$PATH:/usr/local/bin
export PATH
PID=$(pgrep -o run_scrapy.sh)
if [ $$ -ne $PID ] ; then pkill -P $PID ; sleep 2s ; fi
scrapy crawl good

trap runs the defined command (in double quotes) on event EXIT, i.e. when run_scrapy.sh is terminated. There are other events, you'll find them in help trap.
pgrep -o finds the oldest instance of the process with the defined name.

P.S. Your idea with grep -v $$ is good, but it won't return you the PID of other instance of run_scrapy.sh, because $$ will be the PID of the subprocess $(pgrep run_scrapy.sh | grep -v $$), not the PID of run_scrapy.sh which started it. That's why I used another approach.
P.P.S. You'll find some other methods of terminating subprocesses in Bash here.

whtyger
  • 5,900
  • 3
  • 36
  • 48
1

Maybe you should monitor if script is running by creating parent shell script pid file and try to kill previous running parent shell script by checking pid file. Something like that

#!/bin/sh
PATH=$PATH:/usr/local/bin
PIDFILE=/var/run/scrappy.pid
TIMEOUT="10s"

#Check if script pid file exists and kill process
if [ -f "$PIDFILE" ]
then
  PID=$(cat $PIDFILE)
  #Check if process id is valid
  ps -p $PID >/dev/null 2>&1
  if [ "$?" -eq "0" ]
  then
    #If it is valid kill process id
    kill "$PID"
    #Wait for timeout
    sleep "$TIMEOUT"
    #Check if process is still running after timeout
    ps -p $PID >/dev/null 2>&1
    if [ "$?" -eq "0" ]
    then
      echo "ERROR: Process is still running"
      exit 1
    fi
  fi 
fi

#Create PID file
echo $$ > $PIDFILE
if [ "$?" -ne "0" ]
then
  echo "ERROR: Could not create PID file"
  exit 1
fi

export PATH
cd ~/spiders/goods
scrapy crawl good
#Delete PID file
rm "$PIDFILE"
iuuuuan
  • 234
0

too simple :

#!/bin/bash 

pids=($(pidof -x sample.sh))

if [ ${#pids[@]} -gt 1 ] ; then 
                echo "Script already running by pid ${pids[1]}" 
                exit 
fi

echo "Starting service "
sleep 1000
mah454
  • 169
0

It can be very tricky to correctly identify exactly the process(es) belonging to another invocation of the command you're about to run based on a listing of all current processes.

Therefore, a well-established solution to this problem in the Unix world is to use a so-called sentinel file, usually containing nothing but the process id (PID) of the process creating the file (and called a pidfile for that reason).

Prior to invoking the command, you try to create the file with exclusive write access. If this fails, you bail out. If not, you run the command, and after completion, you remove the file.

Now if the command is killed with -KILL, or if the host loses power, you may end up with a lockfile for which the process has died. So at some point you should clean up lockfiles for which no corresponding process is running. This is why the process ID is written to the lockfile.

Using file locking (which wasn't always available in early Unix, but it is in Linux today), you don't need to use the process ID: you can attempt to lock the file, creating it if it doesn't exist. If the process that created the file dies, the file will still exist, but the lock on it will have gone.

Linux now has a standard utility to do this for you: flock (see its manpage). You can wrap it around arbitrary commands.

0

Well, I had a similar problem with C using popen() and like to kill after a timeout parent and all childs. the trick is set a process group ID while starting your parent to don't kill myself. how to do this can be read here: https://stackoverflow.com/questions/6549663/how-to-set-process-group-of-a-shell-script with "ps -eo pid,ppid,cmd,etime" you can filter along the runtime. so with both informations you should be able to filter all old processes and kill them.

0x0C4
  • 713
0

You could check an environment variable to track the status of the script and set it appropriately at script start something like this psuedo code:

if "$SSS" = "Idle"
then 
    set $SSS=Running"
    your script
    set $SSS="Idle"

You can also track status by creating/checking/deleting a marker file like touch /pathname/myscript.is.running and using if exist at launch and rm /pathname/myscript.is.running at end.

This approach will allow you to use different identifiers for your different scrapy scripts to avoid killing the wrong ones.

Regardless of how you track the status of your script and whether you deal with the problem by prevention of launch or killing the running process, I believe that using a wrapper script as suggested by @JacobVlijm & @Serg will make your life much easier.

Elder Geek
  • 36,752