QuotaClimat
The aim of this work is to deliver a tool to a consortium around QuotaClimat, Climat Medias allowing them to quantify the media coverage of the climate crisis.
https://github.com/dataforgoodfr/quotaclimat
Category: Sustainable Development
Sub Category: Knowledge Platforms
Keywords from Contributors
transforms archiving measur profiles compose generic trace tokenizer projection blog
Last synced: about 3 hours ago
JSON representation
Repository metadata
Observatoire des Médias sur l'Ecologie
- Host: GitHub
- URL: https://github.com/dataforgoodfr/quotaclimat
- Owner: dataforgoodfr
- License: mit
- Created: 2022-10-05T10:57:06.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-20T17:04:02.000Z (8 days ago)
- Last Synced: 2025-04-22T00:38:15.701Z (6 days ago)
- Language: Python
- Homepage: https://observatoiremediaecologie.fr/
- Size: 8.09 GB
- Stars: 32
- Watchers: 5
- Forks: 7
- Open Issues: 12
- Releases: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
README.md
The aim of this work is to deliver a tool to a consortium around QuotaClimat, Climat Medias allowing them to quantify the media coverage of the climate crisis.
Radio and TV data are collected thanks to Mediatree API.
And webpress is currently at work in progress (as for 04/2024)
- 2022-09-28, Introduction by Eva Morel (Quota Climat): from 14:10 to 32:00 https://www.youtube.com/watch?v=GMrwDjq3rYs
- 2022-11-29 Project status and prospects by Estelle Rambier (Data): from 09:00 to 25:00 https://www.youtube.com/watch?v=cLGQxHJWwYA
- 2024-03 Project tech presentation by Paul Leclercq (Data) : https://www.youtube.com/watch?v=zWk4WLVC5Hs
Index
🤱 I want to contribute! Where do I start?
- Learn about the project by watching the introduction videos mentioned above.
- Create an issue or/and join https://dataforgood.fr/join and the Slack #offseason_quotaclimat.
- Introduce yourself on Slack #offseason_quotaclimat
🔧 Development
Contributing
🔩 Setting up the environment
Doing the following step will enable your local environement to be aligned with the one of any other collaborator.
First install pyenv:
cd -
brew install pyenv # pyenv itself
brew install pyenv-virtualenv # integration with Python virtualenvsec
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
curl https://pyenv.run | bash
Make the shell pyenv aware:
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
🇫🇷 Dans Propriétés systèmes > Paramètres système avancés > Variables d'environnement...
Choisissez la variable "Path" > Modifier... et ajoutez le chemin de votre installation python, où se trouve le python.exe. (par défaut, C:\Users\username\AppData\Roaming\Python\Scripts\ )
🇬🇧 In System Properties > Advanced > Environment Variables...
Choose the variable "Path" > Edit... et add the path to your python's installation, where is located the pyhton.exe (by default, this should be at C:\Users\username\AppData\Roaming\Python\Scripts\ )
In the console, you can now try :
poetry --version
Let's install a python version (for windows, this step have been done with miniconda):
pyenv install 3.11.6 # this will take time
Check if it works properly, this command:
pyenv versions
should return:
system
3.11.6
Then you are ready to create a virtual environment. Go in the project folder, and run:
pyenv virtualenv 3.11.6 quotaclimat
pyenv local quotaclimat
In case of a version upgrade you can perform this command to switch
eval "$(pyenv init --path)"
pyenv activate 3.11.6/envs/quotaclimat
You now need a tool to manage dependencies. Let's use poetry.
On windows, if not already installed, you will need a VS installation.
pip install poetry
poetry update
poetry lock
NLDA : I have not been able to work with wordcloud on windows.
When you need to install a new dependency (use a new package, e.g. nltk), run
poetry add ntlk
Update dependencies
poetry self update
After commiting to the repo, other team members will be able to use the exact same environment you are using.
Docker
First, have docker and compose installed on your computer
Then to start the different services
## To run only one service, have a look to docker-compose.yml and pick one service :
docker compose up metabase
docker compose up ingest_to_db
docker compose up mediatree
docker compose up test
If you add a new dependency, don't forget to rebuild
docker compose build test # or ingest_to_db, mediatree etc
Explore postgres data using Metabase - a BI tool
docker compose up metabase -d
Will give you access to Metabase to explore the SQL table sitemap table
or keywords
here : http://localhost:3000/
To connect to it you have use the variables used inside docker-compose.yml
:
- password: password
- username: user
- db: barometre
- host : postgres_db
Production metabase
If we encounter a OOM error, we can set this env variable : JAVA_OPTS=-Xmx2g
Web Press - How to scrap
The scrapping of sitemap.xml is done using the library advertools.
A great way to discover sitemap.xml is to check robots.txt page of websites : https://www.midilibre.fr/robots.txt
What medias to parse ? This document is a good start.
Learn more about site maps here.
Scrap every sitemaps
By default, we use a env variable ENV
to only parse from localhost. If you set this value to another thing that docker
or dev
, it will parse everything.
Test
Thanks to the nginx container, we can have a local server for sitemap :
docker compose up -d nginx # used to scrap sitemap locally - a figaro like website with only 3 news
# docker compose up test with entrypoint modified to sleep
# docker exec test bash
pytest -vv --log-level DEBUG test # "test" is the folder containing tests
# Only one test
pytest -vv --log-level DEBUG -k detect
# OR
docker compose up test # test is the container name running pytest test
Deploy
Every commit on the main
branch will build an deploy to the Scaleway container registry a new image that will be deployed. Have a look to .github/deploy-main.yml
.
Learn more here.
Monitoring
With Sentry, with env variable SENTRY_DSN
.
Learn more here : https://docs.sentry.io/platforms/python/configuration/options/
Mediatree - Import data
Mediatree Documentation API : https://keywords.mediatree.fr/docs/
You must contact QuotaClimat team to 2 files with the API's username and password inside :
- secrets/pwd_api.txt
- secrets/username_api.txt
Otherwise, a mock api response is available at https://github.com/dataforgoodfr/quotaclimat/blob/main/test/sitemap/mediatree.json
You can check the API with
curl -X POST https://keywords.mediatree.fr/api/auth/token/ \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=password" \
-d "username=USERNAME" \
-d "password=PASSWORD"
curl -X GET "https://keywords.mediatree.fr/api/epg/?channel=tf1&start_gte=2024-09-01T00:00:00&start_lte=2024-09-01T23:59:59&token=TOKEN_RECEIVED_FROM_PREVIOUS_QUERY"
Run
docker compose up mediatree
Configuration - Batch import
Based on time
If our media perimeter evolves, we have to reimport it all using env variable START_DATE
like in docker compose (epoch second format : 1705409797). By default, it will import 1 day, you can modify it with NUMBER_OF_PREVIOUS_DAYS
(integer).
Otherwise, default is yesterday midnight date (default cron job).
Production safety nets
As Scaleway Serverless service can be down, if some dates are missing until today, it will start back from the latest date saved until today.
As pandas to_sql does not enable upsert (update/insert), if we want to update already saved rows, we have to delete first the rows and then start the program with START_DATE
:
DELETE FROM keywords
WHERE start BETWEEN '2024-05-01' AND '2024-05-30';
Based on channel
Use env variable CHANNEL
like in docker compose (string: tf1)
Otherwise, default is all channels
Update without querying Mediatre API
In case we have a new word detection logic - and already saved data from Mediatree inside our DB (otherwise see Batch import based on time or channel) - we can re-apply it to all saved keywords inside our database.
⚠️ in this case, as we won't requery Mediatree API so we can miss some chunks, but it's faster. Choose wisely between importing/updating.
We should use env variable UPDATE
like in docker compose (should be set to "true")
In order to see actual change in the local DB, run the test first docker compose up test
and then these commands :
docker exec -ti quotaclimat-postgres_db-1 bash # or docker compose exec postgres_db bash
psql -h localhost --port 5432 -d barometre -U user
--> enter password : password
UPDATE keywords set number_of_keywords=1000 WHERE id = '71b8126a50c1ed2e5cb1eab00e4481c33587db478472c2c0e74325abb872bef6';
UPDATE keywords set number_of_keywords=1000 WHERE id = '975b41e76d298711cf55113a282e7f11c28157d761233838bb700253d47be262';
After having updated UPDATE
env variable to true inside docker-compose.yml and running docker compose up mediatree
you should see these logs :
update_pg_keywords.py:20 | Difference old 1000 - new_number_of_keywords 0
We can adjust batch update with these env variables (as in the docker-compose.yml):
BATCH_SIZE: 50000 # number of records to update in one batch
Update only one channel
Use env variable CHANNEL
like in docker compose (string: tf1) with UPDATE
to true
Batch program data
UPDATE_PROGRAM_ONLY
to true will only update program metadata, otherwise, it will update program metadata and all theme/keywords calculations.
UPDATE_PROGRAM_CHANNEL_EMPTY_ONLY
to true will only update program metadata with empty value : "".
Batch update from a date
With +1 millions rows, we can update from an offset to fix a custom logic by using START_DATE_UPDATE
(YYYY-MM-DD - default first day of the current month), the default will use the end of the month otherwise you can specify END_DATE
(optional) (YYYY-MM-DD) to batch update PG from a date range.
Env variables list :
- START_DATE_UPDATE : string (YYYY-MM-DD ) - default to today - minus NUMBER_OF_DAYS (date is included in the query)
- END_DATE : string (YYYY-MM-DD ) - default to end of the month (date is included in the query)
- NUMBER_OF_DAYS : integer default to 7 days - number of days to update from (START_DATE_UPDATE - NUMBER_OF_DAYS) until START_DATE_UPDATE if START_DATE_UPDATE is empty
- STOP_WORD_KEYWORD_ONLY: boolean, default to False. If true will only update rows whose plaintext match top stop words' keyword. It uses to speed up update.
- BIODIVERSITY_ONLY: boolean (default=false), if true will only update rows that have at least one number_of_biodiversity_* > 0
Example inside the docker-compose.yml mediatree service -> START_DATE_UPDATE: 2024-04-01 - default END_DATE will be 2024-04-30
We can use a Github actions to start multiple update operations with different date, set it using the matrix
Production executions
~55 minutes to update 50K rows on a mVCPU 2240 - 4Gb RAM on Scaleway.
Every month has ~80K rows.
SQL Tables evolution
Using Alembic Auto Generating Migrations¶ we can add a new column inside models.py
and it will automatically make the schema evolution :
# If changes have already been applied (on your feature vranch) and you have to recreate your alembic file by doing :
# 1. change to your main branch
git switch main
# 2. start test container (docker compose up testconsole -d / docker compose exec testconsole bash) and run "pytest -vv -k api" to rebuild the state of the DB (or drop table the table you want) - just let it run a few seconds.
# 3. rechange to your WIP branch
git switch -
# 4. connect to the test container : docker compose up testconsole -d / docker compose exec testconsole bash
# 5. reapply the latest saved state :
poetry run alembic stamp head
# 6. Save the new columns
poetry run alembic revision --autogenerate -m "Add new column test for table keywords"
# this should generate a file to commit inside "alembic/versions"
# 7. to apply it we need to run, from our container
poetry run alembic upgrade head
Inside our Dockerfile_api_import, we call this line
# to migrate SQL tables schema if needed
RUN alembic upgrade head
Channel metadata
In order to maintain channel perimeter (weekday, hours) up to date, we save the current version inside postgres/channel_metadata.json
, if we modify this file the next deploy will update every lines of inside Postgresql table channel_metadata
.
Keywords
Produce keywords list from Excel files
How to update quotaclimat/data_processing/mediatree/keyword/keyword.py
from shared excel files ?
Download files locally to "document-experts" from Google Drive (ask on Slack) then :
# Be sure to have updated the folder "document-experts" before running it :
poetry run python3 transform_excel_to_json.py
Program Metadata table
The media perimeter is defined here : "quotaclimat/data_processing/mediatree/channel_program_data.py"
To evolve the media perimeter, we use program_grid_start
and program_grid_end
columns to version all evolutions.
To calculate the right total duration for each channel, after updating "quotaclimat/data_processing/mediatree/channel_program_data.py" you need to execute this command to update postgres/program_metadata.json
poetry run python3 transform_program.py
The SQL queries are based on this file that generate the Program Metadata table.
Program data will not be updated to avoid lock concurrent issues when using UPDATE=true
for keywords logic. Note: The default case will update them.
With the docker-entrypoint.sh this command is done automatically, so for production uses, you will not have to run this command.
Mediatre to S3
For a security nets, we have configured at data pipeline from Mediatree API to S3 (Object Storage Scaleway).
Env variable used :
- START_DATE (integer) (unixtimestamp such as mediatree service)
- NUMBER_OF_PREVIOUS_DAYS (integer): default 7 days to check if something missing
- CHANNEL: (such as mediatree service)
- BUCKET : Scaleway Access key
- BUCKET_SECRET : Scaleway Secret key
- BUCKET_NAME
Stop words
To prevent advertising keywords to blow up statistics, we remove stop words based on the number of times a keyword is said in the same context.
The result will be saved inside postgresql table: stop_word.
This table is read by the service "mediatree" to remove stop words from the field "plaintext" to avoid to count them.
Env variables used :
- START_DATE (integer) (unixtimestamp such as mediatree service)
- NUMBER_OF_PREVIOUS_DAYS (integer): default 7 days
- MIN_REPETITION (integer) : default 15 - Number of minimum repetition of a stop word
- CONTEXT_TOTAL_LENGTH (integer) : default 80 - the length of the advertising context (sentence) saved
- FILTER_DAYS_STOP_WORD (integer): default 30 - number of days to filter the last stop words saved from - to speed up update execution
Remove a stop word
To remove a false positive, we set to false the validated
attribute :
docker exec -ti quotaclimat-postgres_db-1 bash # or docker compose exec postgres_db bash
psql -h localhost --port 5432 -d barometre -U user
--> enter password : password
UPDATE stop_word set validated=false WHERE id = 'MY_ID';
Production monitoring
- Use scaleway
- Use [Ray dashboard] on port 8265
Bump version
poetry version minor
Materialized view - dbt
Using DBT, used via docker :
docker compose up testconsole -d
docker compose exec testconsole bash
> dbt debug # check if this works
> dbt run
We can define some slow queries to make them efficient with materialized views.
To update monthly our materialized view in production we have to use this command that is run on every deployement of api-import (daily)
poetry run dbt run
Fix linting
Before committing, make sure that the line of codes you wrote are conform to PEP8 standard by running:
poetry run black .
poetry run isort .
poetry run flake8 .
There is a debt regarding the cleanest of the code right now. Let's just not make it worth for now.
Thanks
Owner metadata
- Name: Data For Good France
- Login: dataforgoodfr
- Email: [email protected]
- Kind: organization
- Description:
- Website: http://www.dataforgood.fr
- Location: France
- Twitter:
- Company:
- Icon url: https://avatars.githubusercontent.com/u/11797105?v=4
- Repositories: 119
- Last ynced at: 2024-04-24T05:37:38.645Z
- Profile URL: https://github.com/dataforgoodfr
GitHub Events
Total
- Create event: 76
- Commit comment event: 155
- Issues event: 17
- Watch event: 5
- Delete event: 67
- Member event: 1
- Issue comment event: 23
- Push event: 273
- Pull request review comment event: 11
- Pull request review event: 26
- Pull request event: 148
Last Year
- Create event: 76
- Commit comment event: 155
- Issues event: 17
- Watch event: 5
- Delete event: 67
- Member event: 1
- Issue comment event: 23
- Push event: 273
- Pull request review comment event: 11
- Pull request review event: 26
- Pull request event: 148
Committers metadata
Last synced: 6 days ago
Total Commits: 1,526
Total Committers: 19
Avg Commits per committer: 80.316
Development Distribution Score (DDS): 0.511
Commits in past year: 346
Committers in past year: 3
Avg Commits per committer in past year: 115.333
Development Distribution Score (DDS) in past year: 0.506
Name | Commits | |
---|---|---|
github-actions | g****s@g****m | 746 |
Paul Leclercq | p****q@g****m | 237 |
barometre-github-actions | b****s@g****m | 232 |
Rambier Estelle | 3****r | 232 |
Theo Alves Da Costa | t****a@e****m | 15 |
[email protected] | e****n@h****r | 12 |
Beef | b****r@g****m | 11 |
Bastien Gauthier | b****r@n****m | 10 |
dependabot[bot] | 4****] | 6 |
AwaSacko | 8****o | 5 |
greg-lep | g****t@g****m | 5 |
TheoSchwartz | t****z@o****r | 4 |
Thibault Jauneau | t****u@h****u | 3 |
ArnaudWald | a****d@g****m | 2 |
Sebastien Bourgeois | s****0@g****m | 2 |
Hanane | 1****i | 1 |
Thibault Jauneau | t****u@a****o | 1 |
Rémi | 1****s | 1 |
mikaml | m****t@g****m | 1 |
Committer domains:
- github.com: 2
- aircall.io: 1
- hec.edu: 1
- orange.fr: 1
- ntymail.com: 1
- ekimetrics.com: 1
Issue and Pull Request metadata
Last synced: 1 day ago
Total issues: 44
Total pull requests: 387
Average time to close issues: about 2 months
Average time to close pull requests: 11 days
Total issue authors: 2
Total pull request authors: 16
Average comments per issue: 0.23
Average comments per pull request: 0.47
Merged pull request: 246
Bot issues: 0
Bot pull requests: 124
Past year issues: 28
Past year pull requests: 272
Past year average time to close issues: about 1 month
Past year average time to close pull requests: 6 days
Past year issue authors: 2
Past year pull request authors: 4
Past year average comments per issue: 0.14
Past year average comments per pull request: 0.35
Past year merged pull request: 160
Past year bot issues: 0
Past year bot pull requests: 106
Top Issue Authors
- polomarcus (43)
- RDiPiazza (1)
Top Pull Request Authors
- polomarcus (180)
- dependabot[bot] (124)
- estellerambier (59)
- BastienGauthier (7)
- RDiPiazza (3)
- TheoSchwartz (2)
- elise-chin (2)
- HananeMaghlazi (2)
- RR-DataSciences (1)
- SprinTech (1)
- TheoLvs (1)
- thibault-jauneau (1)
- sebastienbourgeois (1)
- greg-lep (1)
- Thenewnative (1)
Top Issue Labels
- enhancement (11)
- bug (3)
- help wanted (2)
- documentation (2)
- wontfix (1)
- good first issue (1)
Top Pull Request Labels
- dependencies (124)
- wontfix (8)
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
Score: 6.728628613084702